A dummy's request for Nepenthes

Maroon@lemmy.world · 17 hours ago

A dummy's request for Nepenthes

Admiral Patrick@dubvee.org · 17 hours ago

So, I set this up recently and agree with all of your points about the actual integration being glossed over.

I already had bot detection setup in my Nginx config, so adding Nepenthes was just changing the behavior of that. Previously, I had just returned either 404 or 444 to those requests but now it redirects them to Nepenthes.

Rather than trying to do rewrites and pretend the Nepenthes content is under my app’s URL namespace, I just do a redirect which the bot crawlers tend to follow just fine.

There’s several parts to this to keep my config sane. Each of those are in include files.

An include file that looks at the user agent, compares it to a list of bot UA regexes, and sets a variable to either 0 or 1. By itself, that include file doesn’t do anything more than set that variable. This allows me to have it as a global config without having it apply to every virtual host.
An include file that performs the action if a variable is set to true. This has to be included in the server portion of each virtual host where I want the bot traffic to go to Nepenthes. If this isn’t included in a virtual host’s server block, then bot traffic is allowed.
A virtual host where the Nepenthes content is presented. I run a subdomain (content.mydomain.xyz). You could also do this as a path off of your protected domain, but this works for me and keeps my already complex config from getting any worse. Plus, it was easier to integrate into my existing bot config. Had I not already had that, I would have run it off of a path (and may go back and do that when I have time to mess with it again).

The map-bot-user-agents.conf is included in the http section of Nginx and applies to all virtual hosts. You can either include this in the main nginx.conf or at the top (above the server section) in your individual virtual host config file(s).

The deny-disallowed.conf is included individually in each virtual hosts’s server section. Even though the bot detection is global, if the virtual host’s server section does not include the action file, then nothing is done.

Files

map-bot-user-agents.conf

Note that I’m treating Google’s crawler the same as an AI bot because…well, it is. They’re abusing their search position by double-dipping on the crawler so you can’t opt out of being crawled for AI training without also preventing it from crawling you for search engine indexing. Depending on your needs, you may need to comment that out. I’ve also commented out the Python requests user agent. And forgive the mess at the bottom of the file. I inherited the seed list of user agents and haven’t cleaned up that massive regex one-liner.

# Map bot user agents
## Sets the $ua_disallowed variable to 0 or 1 depending on the user agent. Non-bot UAs are 0, bots are 1

map $http_user_agent $ua_disallowed {
    default 		0;
    "~PerplexityBot"	1;
    "~PetalBot"		1;
    "~applebot"		1;
    "~compatible; zot"	1;
    "~Meta"		1;
    "~SurdotlyBot"	1;
    "~zgrab"		1;
    "~OAI-SearchBot"	1;
    "~Protopage"	1;
    "~Google-Test"	1;
    "~BacklinksExtendedBot" 1;
    "~microsoft-for-startups" 1;
    "~CCBot"		1;
    "~ClaudeBot"	1;
    "~VelenPublicWebCrawler"	1;
    "~WellKnownBot"	1;
    #"~python-requests"	1;
    "~bitdiscovery"	1;
    "~bingbot"		1;
    "~SemrushBot" 	1;
    "~Bytespider" 	1;
    "~AhrefsBot" 	1;
    "~AwarioBot"	1;
#    "~Poduptime" 	1;
    "~GPTBot" 		1;
    "~DotBot"	 	1;
    "~ImagesiftBot"	1;
    "~Amazonbot"	1;
    "~GuzzleHttp" 	1;
    "~DataForSeoBot" 	1;
    "~StractBot"	1;
    "~Googlebot"	1;
    "~Barkrowler"	1;
    "~SeznamBot"	1;
    "~FriendlyCrawler"	1;
    "~facebookexternalhit" 1;
    "~*(?i)(80legs|360Spider|Aboundex|Abonti|Acunetix|^AIBOT|^Alexibot|Alligator|AllSubmitter|Apexoo|^asterias|^attach|^BackDoorBot|^BackStreet|^BackWeb|Badass|Bandit|Baid|Baiduspider|^BatchFTP|^Bigfoot|^Black.Hole|^BlackWidow|BlackWidow|^BlowFish|Blow|^BotALot|Buddy|^BuiltBotTough|
^Bullseye|^BunnySlippers|BBBike|^Cegbfeieh|^CheeseBot|^CherryPicker|^ChinaClaw|^Cogentbot|CPython|Collector|cognitiveseo|Copier|^CopyRightCheck|^cosmos|^Crescent|CSHttp|^Custo|^Demon|^Devil|^DISCo|^DIIbot|discobot|^DittoSpyder|Download.Demon|Download.Devil|Download.Wonder|^dragonfl
y|^Drip|^eCatch|^EasyDL|^ebingbong|^EirGrabber|^EmailCollector|^EmailSiphon|^EmailWolf|^EroCrawler|^Exabot|^Express|Extractor|^EyeNetIE|FHscan|^FHscan|^flunky|^Foobot|^FrontPage|GalaxyBot|^gotit|Grabber|^GrabNet|^Grafula|^Harvest|^HEADMasterSEO|^hloader|^HMView|^HTTrack|httrack|HTT
rack|htmlparser|^humanlinks|^IlseBot|Image.Stripper|Image.Sucker|imagefetch|^InfoNaviRobot|^InfoTekies|^Intelliseek|^InterGET|^Iria|^Jakarta|^JennyBot|^JetCar|JikeSpider|^JOC|^JustView|^Jyxobot|^Kenjin.Spider|^Keyword.Density|libwww|^larbin|LeechFTP|LeechGet|^LexiBot|^lftp|^libWeb|
^likse|^LinkextractorPro|^LinkScan|^LNSpiderguy|^LinkWalker|msnbot|MSIECrawler|MJ12bot|MegaIndex|^Magnet|^Mag-Net|^MarkWatch|Mass.Downloader|masscan|^Mata.Hari|^Memo|^MIIxpc|^NAMEPROTECT|^Navroad|^NearSite|^NetAnts|^Netcraft|^NetMechanic|^NetSpider|^NetZIP|^NextGenSearchBot|^NICErs
PRO|^niki-bot|^NimbleCrawler|^Nimbostratus-Bot|^Ninja|^Nmap|nmap|^NPbot|Offline.Explorer|Offline.Navigator|OpenLinkProfiler|^Octopus|^Openfind|^OutfoxBot|Pixray|probethenet|proximic|^PageGrabber|^pavuk|^pcBrowser|^Pockey|^ProPowerBot|^ProWebWalker|^psbot|^Pump|python-requests\/|^Qu
eryN.Metasearch|^RealDownload|Reaper|^Reaper|^Ripper|Ripper|Recorder|^ReGet|^RepoMonkey|^RMA|scanbot|SEOkicks-Robot|seoscanners|^Stripper|^Sucker|Siphon|Siteimprove|^SiteSnagger|SiteSucker|^SlySearch|^SmartDownload|^Snake|^Snapbot|^Snoopy|Sosospider|^sogou|spbot|^SpaceBison|^spanne
r|^SpankBot|Spinn4r|^Sqworm|Sqworm|Stripper|Sucker|^SuperBot|SuperHTTP|^SuperHTTP|^Surfbot|^suzuran|^Szukacz|^tAkeOut|^Teleport|^Telesoft|^TurnitinBot|^The.Intraformant|^TheNomad|^TightTwatBot|^Titan|^True_Robot|^turingos|^TurnitinBot|^URLy.Warning|^Vacuum|^VCI|VidibleScraper|^Void
EYE|^WebAuto|^WebBandit|^WebCopier|^WebEnhancer|^WebFetch|^Web.Image.Collector|^WebLeacher|^WebmasterWorldForumBot|WebPix|^WebReaper|^WebSauger|Website.eXtractor|^Webster|WebShag|^WebStripper|WebSucker|^WebWhacker|^WebZIP|Whack|Whacker|^Widow|Widow|WinHTTrack|^WISENutbot|WWWOFFLE|^
WWWOFFLE|^WWW-Collector-E|^Xaldon|^Xenu|^Zade|^Zeus|ZmEu|^Zyborg|SemrushBot|^WebFuck|^MJ12bot|^majestic12|^WallpapersHD)" 1;

}

deny-disallowed.conf

# Deny disallowed user agents
if ($ua_disallowed) { 
    # This redirects them to the Nepenthes domain. So far, pretty much all the bot crawlers have been happy to accept the redirect and crawl the tarpit continuously 
	return 301 https://content.mydomain.xyz/;
}

theunknownmuncher@lemmy.world · edit-2 15 hours ago

Ironically, an LLM could generate your nginx config or the guide you’ve requested

Ooops@feddit.org · 14 hours ago

And if you try often enough it maybe even be a working one…

theunknownmuncher@lemmy.world · edit-2 14 hours ago

Nah, they suck for programming or anything involving imperative logic, but they are pretty decent with things that are declarative, like config. I know people want to hate or deny any usefulness of LLM, and it doesn’t help that corpos insist on cramming LLMs into usecases that aren’t applicable to LLMs at all, but this is actually one of the things they are good at.

ShortN0te@lemmy.ml · 13 hours ago

But still, how would verify if the config is good or not? For example if it exposes root?

irmadlad@lemmy.world · 11 hours ago

You could (should?) run it on a test server/VPS before committing anything to production. I have a little VPS set up just for this purpose. Spin something up on it and observe.

theunknownmuncher@lemmy.world · edit-2 9 hours ago

Yeah I’m not saying its perfect and LLMs are non-deterministic so it could give you some crap. You’re not wrong and it’s good to be aware of that. How do you verify some random stranger from the internet wasn’t an asshole and gave you malicious config? 🤷 The best answer is probably just that OP should heed the warning on the website they linked, if they have no confidence or relevant skills:

THIS IS DELIBERATELY MALICIOUS SOFTWARE INTENDED TO CAUSE HARMFUL ACTIVITY. DO NOT DEPLOY IF YOU AREN’T FULLY COMFORTABLE WITH WHAT YOU ARE DOING.

I pasted the OP unmodified into a local LLM and it gave me this:

Paste this (replace  192.168.1.105 with your Acer’s local IP from Part 1.3): 

server {
    listen 80;
    server_name wowsocool.com www.wowsocool.com;

    location / {
        proxy_pass http://192.168.1.105:8000/;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
    }

along with correct instructions on finding the IP of the laptop, port forwarding, and examples on how to set up DDNS for several popular providers. The only thing I can see that is wrong is the port should be 8893 instead of 8000 and they may want to proxy a different path to Nepenthes than /

irmadlad@lemmy.world · 12 hours ago

and it doesn’t help that corpos insist on cramming LLMs into usecases that aren’t applicable to LLMs at all

I am reminded of back in the late 60s to 70s we did a lot of studies on left handed people and our kneejerk reaction to try and change their dominant hand. We decided that left handed people were absolutely normal, leave them be and stop stressing out adolescents by trying to make them ‘normal’, because they already are. BTW the practice of changing dominant hands goes all the way back to the Catholic church during the middle ages. Anyways, when corporate America heard the news, they started producing all manner of left handed tools, which was helpful, but their motivation was $$. Same with LGBTQ+++. Corporate America capitalized on every aspect.

However, if you plunk down your hard earned money for an AI rice cooker, you’re the idiot and P.T. Barnum would be right once again.

theunknownmuncher@lemmy.world · 12 hours ago

I think the feeling is the same, but the cause is a bit different. It is more similar to the dot-com bubble, where investors (for some reason?) are hyped to throw their money into AI. So if you can market yourself as AI, you can get big investments. Now that you have all that investor cash, you need to justify it somehow by using AI somewhere, anywhere.

irmadlad@lemmy.world · 12 hours ago

I’ve dealt with VC back when I ran an internet radio station. There is pressure to incorporate their wishes, because, well, they want an ROI on their investment.