Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Why are "AI" bots generating so much fuss. Is it because there are so many of them? Is it because AI companies are each writing their own (bad) crawlers instead of using existing ones?


AI bots operators are financially incentivized to not be good citizens, they want as much data as possible as fast as possible and don't care who they piss off in the process. Plus for now at least they have effectively unlimited money to throw at bandwidth, storage, IP addresses, crawling with full-blown headless browsers, etc.


And it gets worse.

For now they are probably paying to use residential IP addresses that they get from other services that sell them (and these services get them from people who willingly sell some of their bandwidth for cents).

But I think it won't be long before we start seeing the AI companies having each their own swarm of residential IP addresses by selling themselves a browser extension or mobile app, saying something like:

"Get faster results (or a discount) by using our extension! By using your own internet connection to fetch the required context, you won't need to share computing resources with other users, thusly increasing the speed of your queries! Plus, since you don't use our servers, that means we can pass our savings to you as a discount!"

Then in small letter saying they use your connection for helping others with their queries, or being more eco-friendly because sharing, or whatever they come up with to justify this.


I thought this as well reading the last discussion. I believe some extra shady free VPNs have used a browser extension to borrow your endpoint to work around geoblocks, etc. I always thought this was a terrible idea, who wants their home internet ip associated with some random VPN users traffic? A voracious mindless bot that slurps up everything it can get to isn't much better.

Microsoft could build this into Windows even, they already use your upload bandwidth to help distribute their updates.


OpenAI has ‘already’ got a browser extension. Who knows when this is ‘enabled’. We already had the ‘honey’ debacle with Amazon/ebay referral link stealing


> OpenAI has ‘already’ got a browser extension. Who knows when this is ‘enabled’.

You could test this theory pretty easily by monitoring traffic... (I haven't, but maybe someone has?)


They are maxing out the CPU of the web servers. For example Anthropic hitting a server 11 times per second non-stop easily loads a basic web server serving a dynamic website. That's like 1 million page views per day. And they continue for weeks even though they could have scraped whatever they are after in less than an hour.


I personally don't care the intended use of the crawling -- and also don't know that the bots we are seeing now are "AI bots", I would not have used that phrase.

What many of us have seen is a huge increase in bot crawling traffic, from highly distributed IPs, and often requesting insane combinations of query params that don't actually get them useful content -- that bring down our sites. (And that increase their volume if you scale up your resources!) They seem to have very deep pockets, in that they don't mind that they are scraping terrabytes of useless/duplicate content from me (they can get all the actual useful open content from my SiteMap and I wouldn't mind!)

That's what bothers me. I don't care if they scrape my site for AI purposes in polite robots.txt-respecting honest-user-agent low-volume ways. And if they are doing it the way they are doing it for something other than AI, it's just as much of a problem. (The best guess is just that it's for AI).

So I agree with you that I wouldn't have spoken of this in terms of "AI".

But it has become a huge problem.

"Fighting the AI scraperbot scourge" https://lwn.net/Articles/1008897/

"LLM crawlers continue to DDoS SourceHut" https://status.sr.ht/issues/2025-03-17-git.sr.ht-llms/

"Open Source devs say AI crawlers dominate traffic, forcing blocks on entire countries" https://arstechnica.com/ai/2025/03/devs-say-ai-crawlers-domi...

Some of us -- we think jokingly -- wonder if Cloudflare or other WAF purveyors are behind it. It is leaving most of us no choice but some kind of WAF or bot detection.


This is explained in the article. Tl;dr for whichever reason these AI bots behave nothing like the web crawlers of old. To quote TFA:

> The current generation of bots is mindless. They use as many connections as you have room for. If you add capacity, they just ramp up their requests. They use randomly generated user-agent strings. They come from large blocks of IP addresses. They get trapped in endless hallways. I observed one bot asking for 200,000 nofollow redirect links pointing at Onedrive, Google Drive and Dropbox. (which of course didn't work, but Onedrive decided to stop serving our Canadian human users). They use up server resources - one speaker at Code4lib described a bug where software they were running was using 32 bit integers for session identifiers, and it ran out!


Although I think it's likely that these are "AI" bots, the real problem is the proliferation of rich and crappy crawlers. Whether or not legacy crawlers respect robots.txt, etc., they do seem to be sophisticated enough to determine when they're stuck in a loop. The home organizations of these new crawlers seem to have more money than sense and are often getting stuck in large dynamic sites for months without retrieving any new information. Among all of the articles about building "bot traps," libraries realized that they have unwittingly been in the bot trapping business for years.


Seems like the latter. There's basically a large number of well-funded attempts to crawl the internet, and enough of them are badly behaved enough it's basically a DDOS against smaller hosts.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: