Incoherent rant.

I’ve, once again, noticed Amazon and Anthropic absolutely hammering my Lemmy instance to the point of the lemmy-ui container crashing. Multiple IPs all over the US.

So I’ve decided to do some restructuring of how I run things. Ditched Fedora on my VPS in favour of Alpine, just to start with a clean slate. And started looking into different options on how to combat things better.

Behold, Anubis.

“Weighs the soul of incoming HTTP requests to stop AI crawlers”

From how I understand it, it works like a reverse proxy per each service. It took me a while to actually understand how it’s supposed to integrate, but once I figured it out all bot activity instantly stopped. Not a single one got through yet.

My setup is basically just a home server -> tailscale tunnel (not funnel) -> VPS -> caddy reverse proxy, now with anubis integrated.

I’m not really sure why I’m posting this, but I hope at least one other goober trying to find a possible solution to these things finds this post.

Anubis Github, Anubis Website

Edit: Further elaboration for those who care, since I realized that might be important.

  • You don’t have to use caddy/nginx/whatever as your reverse proxy in the first place, it’s just how my setup works.
  • My Anubis sits between my local server and inside Caddy reverse proxy docker compose stack. So when a request is made, Caddy redirects to Anubis from its Caddyfile and Anubis decides whether or not to forward the request to the service or stop it in its tracks.
  • There are some minor issues, like it requiring javascript enabled, which might get a bit annoying for NoScript/Librewolf/whatever users, but considering most crawlbots don’t do js at all, I believe this is a great tradeoff.
  • The most confusing part were the docs and understanding what it’s supposed to do in the first place.
  • There’s an option to apply your own rules via json/yaml, but I haven’t figured out how to do that properly in docker yet. As in, there’s a main configuration file you can override, but there’s apparently also a way to add additional bots to block in separate files in a subdirectory. I’m sure I’ll figure that out eventually.

Cheers and I really hope someone finds this as useful as I did.

  • Daniel Quinn@lemmy.ca
    link
    fedilink
    English
    arrow-up
    6
    ·
    15 hours ago

    This all appears to be based on the user agent, so wouldn’t that mean that bad-faith scrapers could just declare themselves to be typical search engine user agent?

    • SorteKanin@feddit.dk
      link
      fedilink
      English
      arrow-up
      3
      ·
      6 hours ago

      Most search engine bots publish a list of verified IP addresses where they crawl from, so you could check the IP of a search bot against that to know.

      • SorteKanin@feddit.dk
        link
        fedilink
        English
        arrow-up
        4
        ·
        6 hours ago

        Actually I think most search engine bots publish a list of verified IP addresses where they crawl from, so you could check the IP of a search bot against that to know.