Incoherent rant.

I’ve, once again, noticed Amazon and Anthropic absolutely hammering my Lemmy instance to the point of the lemmy-ui container crashing. Multiple IPs all over the US.

So I’ve decided to do some restructuring of how I run things. Ditched Fedora on my VPS in favour of Alpine, just to start with a clean slate. And started looking into different options on how to combat things better.

Behold, Anubis.

“Weighs the soul of incoming HTTP requests to stop AI crawlers”

From how I understand it, it works like a reverse proxy per each service. It took me a while to actually understand how it’s supposed to integrate, but once I figured it out all bot activity instantly stopped. Not a single one got through yet.

My setup is basically just a home server -> tailscale tunnel (not funnel) -> VPS -> caddy reverse proxy, now with anubis integrated.

I’m not really sure why I’m posting this, but I hope at least one other goober trying to find a possible solution to these things finds this post.

Anubis Github, Anubis Website

Edit: Further elaboration for those who care, since I realized that might be important.

  • You don’t have to use caddy/nginx/whatever as your reverse proxy in the first place, it’s just how my setup works.
  • My Anubis sits between my local server and inside Caddy reverse proxy docker compose stack. So when a request is made, Caddy redirects to Anubis from its Caddyfile and Anubis decides whether or not to forward the request to the service or stop it in its tracks.
  • There are some minor issues, like it requiring javascript enabled, which might get a bit annoying for NoScript/Librewolf/whatever users, but considering most crawlbots don’t do js at all, I believe this is a great tradeoff.
  • The most confusing part were the docs and understanding what it’s supposed to do in the first place.
  • There’s an option to apply your own rules via json/yaml, but I haven’t figured out how to do that properly in docker yet. As in, there’s a main configuration file you can override, but there’s apparently also a way to add additional bots to block in separate files in a subdirectory. I’m sure I’ll figure that out eventually.

Cheers and I really hope someone finds this as useful as I did.

  • Mora@pawb.social
    link
    fedilink
    English
    arrow-up
    54
    arrow-down
    1
    ·
    edit-2
    10 小时前

    Besides that point: why tf do they even crawl lemmy. They could just as well create a “read only” instance with an account that subscribes to all communities … and the other instances would send their data. Oh, right, AI has to be as unethical as possible for most companies for some reason.

    • wizardbeard@lemmy.dbzer0.com
      link
      fedilink
      English
      arrow-up
      44
      ·
      10 小时前

      They crawl wikipedia too, and are adding significant extra load on their servers, even though Wikipedia has a regularly updated torrent to download all its content.

    • dan@upvote.au
      link
      fedilink
      English
      arrow-up
      14
      ·
      8 小时前

      They’re likely not intentionally crawling Lemmy. They’re probably just crawling all sites they can find.

    • ZombiFrancis@sh.itjust.works
      link
      fedilink
      English
      arrow-up
      36
      arrow-down
      1
      ·
      10 小时前

      See your brain went immediately to a solution based on knowing how something works. That’s not in the AI wheelhouse.

    • AmbitiousProcess@piefed.social
      link
      fedilink
      English
      arrow-up
      8
      ·
      8 小时前

      Because the easiest solution for them is a simple web scraper. If they don’t give a shit about ethics, then something that just crawls every page it can find is loads easier for them to set up than a custom implementation to get torrent downloads for wikipedia, making lemmy/mastodon/pixelfed instances for the fediverse, using rss feeds and checking if they have full or only partial articles, implementing proper checks to prevent double (or more) downloading of the same content, etc.

  • blob42@lemmy.ml
    link
    fedilink
    English
    arrow-up
    5
    ·
    edit-2
    7 小时前

    I am planning to try it out, but for caddy users I came up with a solution that works after being bombarded by AI crawlers for weeks.

    It is a custom caddy CEL expression filter coupled with caddy-ratelimit and caddy-defender.

    Now here’s the fun part, the defender plugin can produce garbage as response so when a matching AI crawler fits it will poison their training dataset.

    Originally I only relied on the rate limiter and noticed that AI bots kept trying whenever the limit was reset. Once I introduced data poisoning they all stopped :)

    git.blob42.xyz {
        @bot <<CEL
            header({'Accept-Language': 'zh-CN'}) || header_regexp('User-Agent', '(?i:(.*bot.*|.*crawler.*|.*meta.*|.*google.*|.*microsoft.*|.*spider.*))')
        CEL
    
    
        abort @bot
        
    
        defender garbage {
    
            ranges aws azurepubliccloud deepseek gcloud githubcopilot openai 47.0.0.0/8
          
        }
    
        rate_limit {
            zone dynamic_botstop {
                match {
                    method GET
                     # to use with defender
                     #header X-RateLimit-Apply true
                     #not header LetMeThrough 1
                }
                key {remote_ip}
                events 1500
                window 30s
                #events 10
                #window 1m
            }
        }
    
        reverse_proxy upstream.server:4242
    
        handle_errors 429 {
            respond "429: Rate limit exceeded."
        }
    
    }
    

    If I am not mistaken the 47.0.0.0/8 ip block is for Alibaba cloud

  • e0qdk@reddthat.com
    link
    fedilink
    English
    arrow-up
    37
    arrow-down
    4
    ·
    14 小时前

    I don’t like Anubis because it requires me to enable JS – making me less secure. reddthat started using go-away recently as an alternative that doesn’t require JS when we were getting hammered by scrapers.

    • BakedCatboy@lemmy.ml
      link
      fedilink
      English
      arrow-up
      35
      ·
      13 小时前

      Fwiw Anubis is adding a nojs meta refresh challenge that if it doesn’t have issues will soon be the new default challenge

      • dan@upvote.au
        link
        fedilink
        English
        arrow-up
        1
        ·
        8 小时前

        Won’t the bots just switch to using that instead of the heavier JS challenge?

        • Sekoia@lemmy.blahaj.zone
          link
          fedilink
          English
          arrow-up
          3
          ·
          8 小时前

          They can, but it’s not trivial. The challenge uses a bunch of modern browser features that these scrapers don’t use, regarding metadata and compression and a few other things. Things that are annoying to implement and not worth the effort. Check the recent discussion on lobste.rs if you’re interested in the exact details.

  • dan@upvote.au
    link
    fedilink
    English
    arrow-up
    3
    ·
    8 小时前

    The Anubis site thinks my phone is a bot :/

    tbh I would have just configured a reasonable rate limit in Nginx and left it at that.

    Won’t the bots just hammer the API instead now?

    • Flipper@feddit.org
      link
      fedilink
      English
      arrow-up
      2
      ·
      6 小时前

      No. The rate limit doesn’t work as they use huge IP Spaces to crawl. Each IP alone is not bad they just use several thousand of them.

      Using the API would assume some basic changes. We don’t do that here. If they wanted that, they could run their own instance and would even get notified about changes. No crawling required at all.

    • dan@upvote.au
      link
      fedilink
      English
      arrow-up
      10
      ·
      8 小时前

      tbh I kinda understand their viewpoint. Not saying I agree with it.

      The Anubis JavaScript program’s calculations are the same kind of calculations done by crypto-currency mining programs. A program which does calculations that a user does not want done is a form of malware.

      • interdimensionalmeme@lemmy.ml
        link
        fedilink
        English
        arrow-up
        2
        ·
        4 小时前

        Requiring client to runs client side code, if tolerated, will lead to the extinction of pure http clients. That in turn will enable to drm the whole web. I rather see it all burn.

      • Natanox@discuss.tchncs.de
        link
        fedilink
        English
        arrow-up
        12
        ·
        edit-2
        8 小时前

        That’s guilt by association. Their viewpoint is awful.

        I also wished there was no security at the gate of concerts, but I happily accept it if that means actual security (if done reasonably of course). And quite frankly, cute anime girl doing some math is so, so much better than those god damn freaking captchas. Or the service literally dying due to AI DDoS.

        Edit: Forgot to mention, proof of work wasn’t invented by or for crypto currency or blockchain. The concept exists since the 90’s (as an idea for Email Spam prevention), making their argument completely nonsensical.

    • chihuamaranian@lemmy.ca
      link
      fedilink
      English
      arrow-up
      4
      ·
      8 小时前

      The FSF explanation of why they dislike Anubis could just as easily apply to the process of decrypting TLS/HTTPS. You know, something uncontroversial that every computer is expected to do when they want to communicate securely.

      I don’t fundamentally see the difference between “The computer does math to ensure end-to-end privacy” and “The computer does math to mitigate DDoS attempts on the server”. Either way, without such protections the client/server relationship is lacking crucial fundamentals that many interactions depend on.

      • rtxn@lemmy.world
        link
        fedilink
        English
        arrow-up
        3
        ·
        7 小时前

        I’ve made that exact comparison before. TLS uses encryption; ransomware also uses encryption; by their logic, serving web content through HTTPS with no way to bypass it is a form of malware. The same goes for injecting their donation banner using an iframe.

  • NotSteve_@piefed.ca
    link
    fedilink
    English
    arrow-up
    3
    ·
    edit-2
    8 小时前

    I love Anubis just because the dev is from my city that’s never talked about (Ottawa)

    • SheeEttin@lemmy.zip
      link
      fedilink
      English
      arrow-up
      2
      ·
      8 小时前

      Well not never, you’ve got the Senators.

      Which will never not be funny to me since it’s Latin for “old men”.

      • NotSteve_@piefed.ca
        link
        fedilink
        English
        arrow-up
        1
        ·
        edit-2
        7 小时前

        Hahaha I didn’t know that but that is funny. Admittedly I’m not too big into hockey so I’ve got no gauge on how popular (edit: or unpopular 😅) the Sens are

  • TomAwezome@lemmy.world
    link
    fedilink
    English
    arrow-up
    4
    ·
    10 小时前

    Thanks for the “incoherent rant”, I’m setting some stuff up with Anubis and Caddy so hearing your story was very welcome :)

  • Possibly linux@lemmy.zip
    link
    fedilink
    English
    arrow-up
    13
    arrow-down
    2
    ·
    14 小时前

    It doesn’t stop bots

    All it does is make clients do as much or more work than the server which makes it less temping to hammer the web.

    • zoey@lemmy.librebun.comOP
      link
      fedilink
      English
      arrow-up
      23
      arrow-down
      1
      ·
      14 小时前

      Yeah, from what I understand it’s nothing crazy for any regular client, but really messes with the bots.
      I don’t know, I’m just so glad and happy it works, it doesn’t mess with federation and it’s barely visible when accessing the sites.

      • Possibly linux@lemmy.zip
        link
        fedilink
        English
        arrow-up
        5
        arrow-down
        1
        ·
        13 小时前

        Personally my only real complaint is the lack of wasm. Outside if that it works fairly well.

  • Daniel Quinn@lemmy.ca
    link
    fedilink
    English
    arrow-up
    6
    ·
    14 小时前

    I’ve been thinking about setting up Anubis to protect my blog from AI scrapers, but I’m not clear on whether this would also block search engines. It would, wouldn’t it?

    • AmbitiousProcess@piefed.social
      link
      fedilink
      English
      arrow-up
      3
      ·
      8 小时前

      Could you elaborate on how it’s ableist?

      As far as I’m aware, not only are they making a version that doesn’t even require JS, but the JS is only needed for the challenge itself, and the browser can then view the page(s) afterwards entirely without JS being necessary to parse the content in any way. Things like screen readers should still do perfectly fine at parsing content after the browser solves the challenge.