Hacker News new | past | comments | ask | show | jobs | submit login

> Multiple IPs makes some sense since requests are being made through random proxies, and I don't think on its own demonstrates intention of bad behavior.

Agreed, just using multiple IPs isn't malicious on it's own. I thought it was notable in the context of issuing another request with a generic browser UA. It's possible that the IP change was a deliberate strategy to avoid detection (like changing the UA likely is), but also possible that it was just a side effect of their infrastructure design, or a combination of the two.

> without even acknowledging, or presenting a viewpoint on, the concern many have with the potential malicious use of these tools.

So, amusingly, they seem to have added a "ethical scraping" page[1] to their docs in between me looking at this a few hours ago and now. (You can see that this page is missing from the sidebar in my archive link from earlier). I particularly enjoy the parts where they say "follow robots.txt rules" and "limit RPS on one site", because as far as I can tell it is actually not possible to do either of these things as a user of this service. There is no mechanism (that I could find) to set an identifiable user agent on the scraper client, nor a mechanism to control the delay between crawling different pages. It's not impossible that they have implemented a reasonable rate limit, with proper backoff when it appears the target is under load, but I wouldn't bet on it.

> Is the actual IP range even within the large Google IP range given the source of the proxies as mentioned elsewhere in this thread?

Good question! I am not able to test this, because they don't expose the proxies without paying them money, which I do not intend to do. My guess would be "no".

> still much to discuss in the broader context.

Yeah. The broader context is that most LLM scrapers are a plague, and I cannot wait for this bubble to pop. Until recently, we were getting upwards of 20k requests per day from individual LLM scrapers on a gitlab instance that I admin. The combination of malice and staggering incompetence with which these are operated is incredible. I have observed fun tactics like "switch to a generic UA and increase the crawling rate after being added to robots.txt" from the same scraper that isn't smart enough to realize that it doesn't need to crawl the same commit hash multiple times per hour. The bit that tells you not to get stuck crawling the CI pipeline results forever is there to protect you, silly.

Things are reportedly much worse[2] for admins of larger services. I saw this referred to as a "DDOS of the entire internet" a while ago, which is pretty accurate.

What we ended up doing is setting up an infinite maze of markov chain nonsense text that we serve to LLM scrapers at a few bytes per second. All they have to do to avoid it is respect robots.txt. I recommend this! It's fun and effective, and if we're lucky, it might cause harm to some people and systems that deserve it.

[1]: https://web.archive.org/web/20250322072210/https://docs.hype... [2]: https://status.sr.ht/issues/2025-03-17-git.sr.ht-llms/




> The broader context is that most LLM scrapers are a plague, and I cannot wait for this bubble to pop.

My thing is, there are legitimate uses for automated browsing, uses that could be extremely useful and yet nondamaging to (or even supported by!) site operators. But we'll never get to have them if the tools/methods to implement them are the same as the ones used by people inadvertently DDOSing the site they're trying to inhale the entire contents of. For them to not get lumped together, purveyors of the tools cannot remain "neutral" or hide implementation details or condone, whether explicit or implicit, bad behavior. We've seen this happen on the web before, and we're already seeing desperate organizations implement nuclear-option LLM scraper blockers that also take out things like RSS readers. Anyways... I may just have to write something about this...

> So, amusingly, they seem to have added a "ethical scraping" page[1] to their docs in between me looking at this a few hours ago and now.

Congrats, you made an impact :)




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: