Website operators are being asked to feed LLM crawlers poisoned data by a project called Poison Fountain.

The project page links to URLs which provide a practically endless stream of poisoned training data. They have determined that this approach is very effective at ultimately sabotaging the quality and accuracy of AI which has been trained on it.

Small quantities of poisoned training data can significantly damage a language model.

The page also gives suggestions on how to put the provided resources to use.

  • eru@mouse.chitanda.moe
    link
    fedilink
    English
    arrow-up
    4
    ·
    4 hours ago

    i would imagine companies would just filter it out

    need some more clever way of hiding it or allow it to be self hosted so that it has various urls

    • GamingChairModel@lemmy.world
      link
      fedilink
      English
      arrow-up
      3
      ·
      4 hours ago

      If I am reading this correctly, anyone who wants to use this service can just configure their HTTP server to act as the man in the middle of the request, so that the crawler sees your URL but is retrieving poison fountain content from the poison fountain service.

      If so, that means the crawlers wouldn’t be able to filter by URL because the actual handler that responds to the HTTP request doesn’t ever see the canonical URL of the poison fountain.

      In other words, the handler is “self hosted” at its own URL while the stream itself comes from the same URL that the crawler never sees.

  • termaxima@slrpnk.net
    link
    fedilink
    English
    arrow-up
    32
    ·
    11 hours ago

    Been thinking about making one of these too, especially since I have a catchy name : asbestos

  • vacuumflower@lemmy.sdf.org
    link
    fedilink
    English
    arrow-up
    17
    arrow-down
    9
    ·
    14 hours ago

    If, suppose, I were optimistic over this technology, but pessimistic over its current stage of development, I’d expect this to be a cure. It’s a problem they’ll have to solve. A test they’ll have to pass.

    If somewhere inside those things someone makes a mechanism building a graph of syllogisms, no kind of poisoned input data will be able to hurt them.

    So - this is a good thing, but when people say it’s a rebellion, it’s not.

    • kadu@scribe.disroot.org
      link
      fedilink
      English
      arrow-up
      8
      ·
      6 hours ago

      Samsung and Anthropic published independently created data showing how little bad data it takes to effectively poison very large models. LLMs pretend to be complex, but they aren’t, they’ll not continue to improve at the initial rate we got used to seeing. Just ask OpenAI.

      • vacuumflower@lemmy.sdf.org
        link
        fedilink
        English
        arrow-up
        2
        ·
        4 hours ago

        I’m not talking about LLMs. I’m talking about future developments learning on LLMs, eventually there will be some resolutions of conflicting knowledge and logical connections, otherwise they won’t become remotely as useful as advertised.

    • FlashMobOfOne@lemmy.world
      link
      fedilink
      English
      arrow-up
      15
      ·
      8 hours ago

      A test they’ll have to pass.

      This makes me chuckle, as they invented euphemisms like ‘hallucinations’ because their LLM models can’t do what they promise. Fabulous marketing, but clearly they didn’t do enough testing.

      • vacuumflower@lemmy.sdf.org
        link
        fedilink
        English
        arrow-up
        1
        ·
        4 hours ago

        I said, in other words, that it doesn’t matter what they do until this problem is solved. So if this is described as some sort of rebellion against AI (or “AI”), then no. At the point where it becomes dangerous technology in itself and not just for economy, it won’t be.

      • Bazoogle@lemmy.world
        link
        fedilink
        English
        arrow-up
        1
        ·
        5 hours ago

        as they invented euphemisms like ‘hallucinations’

        Seems like a pretty accurate word to use, no? Could also use fabrication, concoction, phantom, or something else? I think “lie” and its synonyms are not accurate, since that requires intent. Since the LLM does not have intent, it cannot “lie”.

        • GamingChairModel@lemmy.world
          link
          fedilink
          English
          arrow-up
          2
          ·
          4 hours ago

          That’s why “bullshit,” as defined by Harry Frankfurt, is so useful for describing LLMs.

          A lie is a false statement that the speaker knows to be false. But bullshit is a statement made by a speaker who doesn’t care if it’s true or false.

    • RobotsLeftHand@lemmy.world
      link
      fedilink
      English
      arrow-up
      3
      ·
      5 hours ago

      “You’re not opposing me. All you’ve done is create a problem that will stop me until I have it figured out.” is the description of every struggle between opposing forces, so it’s interesting that you disagree with that.

      • vacuumflower@lemmy.sdf.org
        link
        fedilink
        English
        arrow-up
        2
        ·
        4 hours ago

        Not really, more like “if I can find a key to the door, I can open it, so engraving a fixed combination for the door lock on the same key doesn’t change much”.

        Poisoned data is fundamentally valid data. Concepts of logical connectivity and statements being true or false are something needed to use it.

    • Disillusionist@piefed.worldOP
      link
      fedilink
      English
      arrow-up
      19
      arrow-down
      1
      ·
      14 hours ago

      Not all problems may be cured immediately. Battles are rarely won with a single attack. A good thing is not the same as nothing.

  • chunes@lemmy.world
    link
    fedilink
    English
    arrow-up
    18
    arrow-down
    17
    ·
    15 hours ago

    Small quantities of poisoned training data can significantly damage a language model.

    Source: trust me bro.

    Nightshade tried the same thing and it never worked.

  • BigBolillo@mgtowlemmy.org
    link
    fedilink
    English
    arrow-up
    3
    arrow-down
    45
    ·
    edit-2
    4 hours ago

    Seems like a bad take from my POV, as someone who uses and has made money using LLMs I feel is not ok to poison them, I wouldn’t feel ok with myself getting something for free and even gain money with and at the same time be poisoning it so my take will be: you can always block crawlers in your nginx.conf with some extra steps, you can even use an LLM to do it for you and improve to block all major crawlers. IMHO if it’s public data is even public for crawlers is up to you if you set up a block for these on your behalf.

    • Taldan@lemmy.world
      link
      fedilink
      English
      arrow-up
      4
      ·
      5 hours ago

      As someone who makes and uses software, I feel it is not okay to steal source code. I wouldn’t feel okay with myself getting something for free when it’s based on the stolen work of tens of thousands of people

      AI companies aren’t respecting crawler blocking. They’re actively working to ensure their crawlers bypass any anti-crawler protections


      As a side note, these efforts help AI in the long-term. If we can poison LLMs, then you can guarantee a state actor can as well. AI needs to be able to weather training data attacks, otherwise they become an easily manipulated propaganda tool

    • RalfWausE@feddit.org
      link
      fedilink
      English
      arrow-up
      25
      arrow-down
      3
      ·
      14 hours ago

      What about the following take: LLMs are an abomination that consumes enormous masses of resources for… well… really nothing besides being a tool to further enshittify the Internet and the world as a whole, being a tool for making it easy creating ever more divisive content (not to mention the special content Grok is now known for), killing jobs and replacing genuine human creativity by a cheap, warped imitation thereof.

      My opinion is: Everybody who uses or promotes this technology is accomplice in making the world a worse place.

    • hector@lemmy.today
      link
      fedilink
      English
      arrow-up
      21
      arrow-down
      1
      ·
      edit-2
      14 hours ago

      It would not be fair to prevent ai from violating every single copyright on the earth? That is a novel take.

      Especially as most do not use ai but companies are trying to force them to, to ultimately replace half the workforce and send the economy into a doom spiral.

    • Disillusionist@piefed.worldOP
      link
      fedilink
      English
      arrow-up
      18
      ·
      15 hours ago

      “Public” is a tricky term. At this point everything is being treated as public by LLM developers. Maybe not you specifically, but a lot of people aren’t happy with how their data is being used to train AI.

      • Señor Mono@feddit.org
        link
        fedilink
        English
        arrow-up
        10
        ·
        14 hours ago

        Also they always come up with new ways to circumvent blocking mechanisms and push some extra work to admins.

        Remember how judges ruled when somebody circumvented copy restrictions on media?

  • FaceDeer@fedia.io
    link
    fedilink
    arrow-up
    9
    arrow-down
    13
    ·
    15 hours ago

    Doesn’t work, but I guess if it makes people feel better I suppose they can waste their resources doing this.

    Modern LLMs aren’t trained on just whatever raw data can be scraped off the web any more. They’re trained with synthetic data that’s prepared by other LLMs and carefully crafted and curated. Folks are still thinking ChatGPT 3 is state of the art here.

    • Taldan@lemmy.world
      link
      fedilink
      English
      arrow-up
      3
      ·
      5 hours ago

      Let’s say I believe you. If that’s the case, why are AI companies still scraping everything?

      • FaceDeer@fedia.io
        link
        fedilink
        arrow-up
        2
        ·
        4 hours ago

        Raw materials to inform the LLMs constructing the synthetic data, most likely. If you want it to be up to date on the news, you need to give it that news.

        The point is not that the scraping doesn’t happen, it’s that the data is already being highly processed and filtered before it gets to the LLM training step. There’s a ton of “poison” in that data naturally already. Early LLMs like GPT-3 just swallowed the poison and muddled on, but researchers have learned how much better LLMs can be when trained on cleaner data and so they already take steps to clean it up.

    • KeenFlame@feddit.nu
      link
      fedilink
      English
      arrow-up
      1
      ·
      3 hours ago

      Ai devalues datasets when it refines, many resources are aimed towards solving the degradation that occurs when ai trains on ai. Gradients become poor and quality follows

      • FaceDeer@fedia.io
        link
        fedilink
        arrow-up
        1
        ·
        3 hours ago

        You’re thinking of “model decay”, I take it? That’s not really a thing in practice.

    • Disillusionist@piefed.worldOP
      link
      fedilink
      English
      arrow-up
      12
      ·
      15 hours ago

      From what I’ve heard, the influx of AI data is one of the reasons actual human data is becoming increasingly sought after. AI training AI has the potential to become a sort of digital inbreeding that suffers in areas like originality and other ineffable human qualities that AI still hasn’t quite mastered.

      I’ve also heard that this particular approach to poisoning AI is newer and thought to be quite effective, though I can’t personally speak to its efficacy.

    • XLE@piefed.social
      link
      fedilink
      English
      arrow-up
      3
      ·
      11 hours ago

      Do you have any basis for this assumption, FaceDeer?

      Based on your pro-AI-leaning comments in this thread, I don’t think people should accept defeatist rhetoric at face value.

      • FaceDeer@fedia.io
        link
        fedilink
        arrow-up
        2
        ·
        9 hours ago

        A basic Google search for “synthetic data llm training” will give you lots of hits describing how the process goes these days.

        Take this as “defeatist” if you wish, as I said it doesn’t really matter. In the early days of LLMs when ChatGPT first came out the strategy for training these things was to just dump as much raw data onto them as possible and hope quantity allowed the LLM to figure something out from it, but since then it’s been learned that quality is better than quantity and so training data is far more carefully curated these days. Not because there’s “poison” in it, just because it results in better LLMs. Filtering out poison will happen as a side effect.

        It’s like trying to contaminate a city’s water supply by peeing in the river upstream of the water treatment plant drawing from it. The water treatment plant is already dealing with all sorts of contaminants anyway.

        • FauxLiving@lemmy.world
          link
          fedilink
          English
          arrow-up
          1
          ·
          6 hours ago

          That may be an argument if only large companies existed and they only trained foundation models.

          Scraped data is most often used for fine-tuning models for specific tasks. For example, mimicking people on social media to push an ad/political agenda. Using a foundational model that speaks like it was trained on a textbook doesn’t work for synthesizing social media comments.

          In order to sound like a Lemmy user, you need to train on data that contains the idioms, memes and conversational styles used in the Lemmy community. That can’t be created from the output of other models, it has to come from scraping.

          Poisoning the data going to the scrapers will either kill the model during training or force everyone to pre-process their data, which increases the costs and expertise required to attempt such things.

          • FaceDeer@fedia.io
            link
            fedilink
            arrow-up
            1
            ·
            5 hours ago

            Are you proposing flooding the Fediverse with fake bot comments in order to prevent the Fediverse from being flooded with fake bot comments? Or are you thinking more along the lines of that guy who keeps using “Þ” in place of “th”? Making the Fediverse too annoying to use for bot and human alike would be a fairly phyrric victory, I would think.

            • FauxLiving@lemmy.world
              link
              fedilink
              English
              arrow-up
              1
              ·
              5 hours ago

              I am proposing neither of those things.

              The way to effectively use this is to detect scraping through established means and, instead of banning them, altering the output to feed the target poisoned data instead of/in addition to the real content.

              Banning a target gives them information about when they were detected and allows them to alter their profile to avoid that. If they’re never banned then they lose that information and also they now have to deploy additional resources to attempt to detect and remove poisoned data.

              Either way, it causes the adversary to spend a lot of resources at very little cost to you.

              • FaceDeer@fedia.io
                link
                fedilink
                arrow-up
                1
                ·
                5 hours ago

                I have no idea what “established means” would be. In the particular case of the Fediverse it seems impossible, you can just set up your own instance specifically intended for harvesting comments and use that. The Fediverse is designed specifically to publish its data for others to use in an open manner.

                • GamingChairModel@lemmy.world
                  link
                  fedilink
                  English
                  arrow-up
                  1
                  ·
                  3 hours ago

                  The Fediverse is designed specifically to publish its data for others to use in an open manner.

                  Sure, and if the AI companies want to configure their crawlers to actually use APIs and ActivityPub to efficiently scrape that data, great. Problem is that there’s been crawlers that have done things very inefficiently (whether by malice, ignorance, or misconfiguration) and scrape the HTML of sites repeatedly, driving up some hosting costs and effectively DOSing some of the sites.

                  If you put Honeypot URLs in the mix and keep out polite bots with robots.txt and keep out humans by hiding those links, you can serve poisoned responses only to the URLs that nobody should be visiting and not worry too much about collateral damage to legitimate visitors.

  • Lembot_0006@programming.dev
    link
    fedilink
    English
    arrow-up
    13
    arrow-down
    46
    ·
    16 hours ago

    Idiots: This new technology is still quite ineffective. Let’s sabotage it’s improvement!

    Imbeciles: Yeah!

    • Stern@lemmy.world
      link
      fedilink
      English
      arrow-up
      50
      arrow-down
      1
      ·
      16 hours ago

      Corpos: Don’t steal our stuff! That’s piracy!

      Also corpos: Your stuff? My stuff now.

      Bootlickers: Oh my god this shoe polish is delicious.

      • FauxLiving@lemmy.world
        link
        fedilink
        English
        arrow-up
        2
        arrow-down
        2
        ·
        6 hours ago

        Person: Says a thing

        Person 2, who disagrees with the thing: YOU’RE A BOOTLICKER!

        Super convincing. I’m sure you’re going to win people over to your position if you scream loud enough.

      • Lembot_0006@programming.dev
        link
        fedilink
        English
        arrow-up
        5
        arrow-down
        20
        ·
        15 hours ago

        You should select something: whether you like the current copyright system or not. You can’t do both.

        • arcterus@piefed.blahaj.zone
          link
          fedilink
          English
          arrow-up
          18
          ·
          14 hours ago

          Corporations want the existing copyright system for their own products but simultaneously want to freely scrape data from everyone else.

            • arcterus@piefed.blahaj.zone
              link
              fedilink
              English
              arrow-up
              13
              ·
              edit-2
              14 hours ago

              This issue is largely manifesting through AI scraping right now. Additionally, many intentionally ignore robots.txt. Currently, LLM scrapers are basically just bad actors on the internet. Courts have also ruled in favor of a number of AI companies when sued in the US, so it’s unlikely anything will change. Effectively, if you don’t like the status quo, stuff like this is one of your few options.

              This isn’t even mentioning of course whether we actually want these companies to improve their models before resolving the problems of energy consumption and potential displacement of human workers.

              • Lembot_0006@programming.dev
                link
                fedilink
                English
                arrow-up
                6
                arrow-down
                8
                ·
                13 hours ago

                All crawlers ignore robots text since the very start. Anyway, if THAT is the problem then IT is a problem, not the LLMs as a whole.

                • FauxLiving@lemmy.world
                  link
                  fedilink
                  English
                  arrow-up
                  2
                  arrow-down
                  1
                  ·
                  6 hours ago

                  You can tell when you’re talking with someone who has been given the position of ‘AI Bad’, but doesn’t actually understand the moral positions or technological details that form the foundation of that argument by how confidently they repeat some detail that is clearly nonsense to anybody with knowledge of the subject.

    • Disillusionist@piefed.worldOP
      link
      fedilink
      English
      arrow-up
      35
      arrow-down
      2
      ·
      16 hours ago

      AI companies could start, I don’t know- maybe asking for permission to scrape a website’s data for training? Or maybe try behaving more ethically in general? Perhaps then they might not risk people poisoning the data that they clearly didn’t agree to being used for training?

      • Lembot_0006@programming.dev
        link
        fedilink
        English
        arrow-up
        7
        arrow-down
        18
        ·
        15 hours ago

        Why should they ask permission to read freely provided data? Nobody’s asking for any permission, but LLM trainers somehow should? And what do you want from them from an ethical standpoint?

        • GunnarGrop@lemmy.ml
          link
          fedilink
          English
          arrow-up
          14
          ·
          15 hours ago

          Much of it might be freely available data, but there’s a huge difference between you accessing a website for data and an LLM doing the same thing. We’ve had bots scraping websites since the 90’s, it’s not a new thing. And since scraping bots have existed we’ve developed a standard on the web to deal with it, called “robots.txt”. A text file telling bots what they are allowed to do on websites and how they should behave.

          LLM’s are notorious for disrespecting this, leading to situations where small companies and organisations will have their websites scraped so thoroughly and frequently that they can’t even stay online anymore, as well as skyrocketing their operational costs. In the last few years we’ve had to develop ways just to protect ourselves against this. See the “Anubis” project.

          Hence, it’s much more important that LLM’s follow the rules than you and me doing so on an individual level.

          It’s the difference between you killing a couple of bees in your home versus an industry specialising in exterminating bees at scale. The efficiency is a big factor.

        • Disillusionist@piefed.worldOP
          link
          fedilink
          English
          arrow-up
          10
          ·
          15 hours ago

          Is the only imaginable system for AI to exist one in which every website operator, or musician, artist, writer, etc has no say in how their data is used? Is it possible to have a more consensual arrangement?

          As far as the question about ethics, there is a lot of ground to cover on that. A lot of it is being discussed. I’ll basically reiterate what I said that pertains to data rights. I believe they are pretty fundamental to human rights, for a lot of reasons. AI is killing open source, and claiming the whole of human experience for its own training purposes. I find that unethical.

              • Lembot_0006@programming.dev
                link
                fedilink
                English
                arrow-up
                3
                arrow-down
                7
                ·
                14 hours ago

                The guy is talking about consulting as I understand. Yes, LLM is great for reading the documentation. That’s the purpose of LLM. Now people can use those libraries without spending ages reading through docs. That’s progress. I see it as a way to write more open source because it became simpler and less tedious.

                • Disillusionist@piefed.worldOP
                  link
                  fedilink
                  English
                  arrow-up
                  8
                  ·
                  14 hours ago

                  He’s jumping ship because it’s destroying his ability to eke out a living. The problem isn’t a small one, what’s happening to him isn’t a limited case.

        • ExLisper@lemmy.curiana.net
          link
          fedilink
          English
          arrow-up
          6
          ·
          14 hours ago

          Yes, they should because they generate way more traffic. Why do you think people are trying to protect websites from AI crawlers? Because they want to keep public data secret?

          Also, everyone knows AI companies used copyrighted materials and private data without permission. If you think they only used public data you’re uninformed or lying on their behalf.

          • Lembot_0006@programming.dev
            link
            fedilink
            English
            arrow-up
            4
            arrow-down
            4
            ·
            14 hours ago

            I personally consider the current copyright laws completely messed up, so I see no problem in using any data technically available for processing.

            • ExLisper@lemmy.curiana.net
              link
              fedilink
              English
              arrow-up
              8
              ·
              14 hours ago

              Ok, so you think it’s ok for big companies to break the laws you don’t like, cool. I’m sure those big companies will not sue you when you infringe on some of their laws you don’t like.

              And I like the way you just ignored the two other issues I mentioned. Are you fine with AI bots slowing sites like Codeberg to a crawl? Are you fine with AI companies using personal data without consent?

                • ExLisper@lemmy.curiana.net
                  link
                  fedilink
                  English
                  arrow-up
                  4
                  ·
                  13 hours ago

                  I’m also fine with them using data they can get for free like, I don’t know, weather data they collect themselves?

                  Data hosted by private individuals and open source projects is not free. Someone has to pay for hosting and AI companies sucking data with army of bots is elevating the cost of hosting beyond the means of those people/projects. They are shifting the costs of providing the “free” data on the community while keeping all the profits.

                  Private data used without consent is also not free. It’s valuable, protected data and AI companies are simply stealing it. Do you consider stolen things free?

                  I see your attitude is “they don’t hurt me personally and I don’t care what they do to other people”. It’s either ignorant or straight antisocial. Also a bit bootlickish.

        • BaroqueInMind@piefed.social
          link
          fedilink
          English
          arrow-up
          6
          arrow-down
          14
          ·
          15 hours ago

          As someone who self-hosts a LLM and trains it on web data regularly to improve my model, I get where your frustration is coming from.

          But engaging in discourse here where people already have a heavy bias against machine-learning language models is a fruitless effort. No one here is going to provide you catharsis with a genuine conversation that isnt rhetoric.

          Just put the keyboard down and walk away.

          • Rekall Incorporated@piefed.social
            link
            fedilink
            English
            arrow-up
            6
            ·
            15 hours ago

            I don’t have a bias against LLMs, I use them regularly albeit either for casual things (movie recommendation) or an automation tool in work areas where I can somewhat easily validate the output or the specific task is low impact.

            I am just curious, do you respect robots.txt?

          • FaceDeer@fedia.io
            link
            fedilink
            arrow-up
            3
            arrow-down
            1
            ·
            15 hours ago

            I think it’s worthwhile to show people that views outside of their like-minded bubble exist. One of the nice things about the Fediverse over Reddit is that the upvote and downvote tallies are both shown, so we can see that opinions are not a monolith.

            Also, engaging in Internet debate is never to convince the person you’re actually talking to. That almost never happens. The point of debate is to present convincing arguments for the less-committed casual readers who are lurking rather than participating directly.

            • Disillusionist@piefed.worldOP
              link
              fedilink
              English
              arrow-up
              1
              ·
              15 hours ago

              I agree with you that there can be value in “showing people that views outside of their likeminded bubble[s] exist”. And you can’t change everyone’s mind, but I think it’s a bit cynical to assume you can’t change anyone’s mind.

          • Disillusionist@piefed.worldOP
            link
            fedilink
            English
            arrow-up
            1
            ·
            15 hours ago

            I can’t speak for everyone, but I’m absolutely glad to have good-faith discussions about these things. People have different points of view, and I certainly don’t know everything. It’s one of the reasons I post, for discussion. It’s really unproductive to make blanket statements that try to end discussion before it starts.

            • FauxLiving@lemmy.world
              link
              fedilink
              English
              arrow-up
              1
              ·
              edit-2
              6 hours ago

              It’s really unproductive to make blanket statements that try to end discussion before it starts.

              I don’t know, it seems like their comment accurately predicted the response.

              Even if you want to see yourself as some beacon of open and honest discussion, you have to admit that there are a lot of people who are toxic to anybody who mentions any position that isn’t rabidly anti-AI enough for them.