They probably asynchronously verify that the IP address actually belongs to goog...

selcuka · 2024-08-20T01:21:03 1724116863

> Synchronously verifying it, would probably be too slow.

Why would it be slow? There is a JSON documenbt that lists all IP ranges on the same page you linked to:

https://developers.google.com/static/search/apis/ipranges/go...

WatchDog · 2024-08-20T01:40:18 1724118018

Sure, that's one option, and I don't have any insight into what nyt actually does with regards to it's handling of googlebot traffic.

But if I were implementing filtering, I might prefer a solution that doesn't require keeping a whitelist up to date.

Gooblebrai · 2024-08-20T06:09:04 1724134144

The whitelist can be updated asynchronously

bomewish · 2024-08-20T06:47:48 1724136468

Maybe we could use GCP infra and the trick will work better ?

immibis · 2024-08-20T15:10:57 1724166657

Which leads to the possibility of triggering a self-inflicted DoS. I am behind a CGNAT right now. You reckon that if I set myself to Googlebot and loaded NYT, they'd ban the entire o2 mobile network in Germany? (or possibly shared infrastructure with Deutsche Telekom - not sure)

Not to mention the possibility of just filling up the banned IP table.

xp84 · 2024-08-21T03:50:20 1724212220

Hypothetically if they were doing that, they’d only be ‘banning’ that mobile network in the ‘paywall-relaxing-for-Googlebot’ code - not banning the IP traffic or serving a 403 or anything. They ordinarily throw paywalls at those users anyway.

katzgrau · 2024-08-20T00:50:30 1724115030

There are easily installable databases of IP block info, super easy to do it synchronously, especially if it’s stored in memory. I run a small group of servers that each have to do it thousands of times per second.