• auraithx@piefed.social
    link
    fedilink
    English
    arrow-up
    3
    ·
    1 day ago

    Yeah this will have absolutely no impact to gathering training data.

    I assumed it was to block ai agents crawling it during requests, which they’d be unlikely to bypass in the web ui.

    But no company spending millions on training will hesitate to have an agent appear as a regular desktop user to scrape data.

    • boonhet@sopuli.xyz
      link
      fedilink
      English
      arrow-up
      2
      ·
      1 day ago

      Does cloudflare still look at the agent? I thought they have more reliable data points.

      • auraithx@piefed.social
        link
        fedilink
        English
        arrow-up
        1
        ·
        1 day ago

        I meant an ai agent not the browser agent. All data points can be spoofed and if not they’ll pay a human to scrape before they pay for content.

        • boonhet@sopuli.xyz
          link
          fedilink
          English
          arrow-up
          1
          ·
          23 hours ago

          Okay, fair enough, I thought you meant just the user agent. Trouble with having a bot make it look like an actual user is looking at the data, is that it’s slow and inefficient. Trouble with paying humans to scrape the data is that it’s slow and inefficient. These companies want to ingest data ridiculously fast because there’s so much of it. If all else fails, they’ll resort to paying the content creators. But only if it’s data they really do think gives their model a competitive edge in some metric and they can’t pirate it. E.g I can see them paying for scientific research they can’t get from libgen, but not some rando’s blog post or local news website.