Hacker News new | past | comments | ask | show | jobs | submit login

They probably asynchronously verify that the IP address actually belongs to googlebot, then ban the IP when it fails.

Synchronously verifying it, would probably be too slow.

You can verify googlebot authenticity by doing a reverse dns lookup, then checking that reverse dns name resolves correctly to the expected IP address[0].

[0]: https://developers.google.com/search/docs/crawling-indexing/...




> Synchronously verifying it, would probably be too slow.

Why would it be slow? There is a JSON documenbt that lists all IP ranges on the same page you linked to:

https://developers.google.com/static/search/apis/ipranges/go...


Sure, that's one option, and I don't have any insight into what nyt actually does with regards to it's handling of googlebot traffic.

But if I were implementing filtering, I might prefer a solution that doesn't require keeping a whitelist up to date.


The whitelist can be updated asynchronously


Maybe we could use GCP infra and the trick will work better ?


Which leads to the possibility of triggering a self-inflicted DoS. I am behind a CGNAT right now. You reckon that if I set myself to Googlebot and loaded NYT, they'd ban the entire o2 mobile network in Germany? (or possibly shared infrastructure with Deutsche Telekom - not sure)

Not to mention the possibility of just filling up the banned IP table.


Hypothetically if they were doing that, they’d only be ‘banning’ that mobile network in the ‘paywall-relaxing-for-Googlebot’ code - not banning the IP traffic or serving a 403 or anything. They ordinarily throw paywalls at those users anyway.


There are easily installable databases of IP block info, super easy to do it synchronously, especially if it’s stored in memory. I run a small group of servers that each have to do it thousands of times per second.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: