> set the user agent header to the googlebot one Also, how effective is this rea...

mdotk · 2024-08-19T21:02:37 1724101357

This. 12ft has never ever worked for me.

HKH2 · 2024-08-20T01:44:33 1724118273

I know one website it works well on, so I still use it, but yes, most others fail.

dutchmartin · 2024-08-19T21:34:23 1724103263

If you would host that server on Google cloud, you would make it a lot harder already.

jsheard · 2024-08-19T21:40:23 1724103623

https://developers.google.com/search/docs/crawling-indexing/...

They provide ways to verify Googlebot IPs specifically, anyone who cares to check wouldn't be fooled by running a fake Googlebot on Googles cloud.

Likewise with Bingbot: https://www.bing.com/webmasters/help/how-to-verify-bingbot-3...

KennyBlanken · 2024-08-19T21:51:00 1724104260

yes, where "cares" means "the lost revenue is greater than the cost of development, QA, and computational/network/storage overhead, and the impact of increased complexity, of a function that figures out whether people are faking their user agent."

It's probably orders of magnitude greater than the revenue loss from the tiny minority of people doing such things, especially given not everyone who uses tools like these will become a subscriber if blocked, so that cuts the "lost" revenue down even further.

jsheard · 2024-08-19T21:54:58 1724104498

Even if it's not worth an actual site operators time to implement such a system themselves, WAFs like Cloudflare could easily check the IP address of clients claiming to be Googlebot/Bingbot and send them to CAPTCHA Hell on the sites behalf if they're lying. That's pretty low hanging fruit for a WAF, I would be surprised if they don't do that.

edit: Indeed I just tried curling cloudflare.com with Googlebots user agent and they immediately gave me the finger (403) on the very first request.

ZoomerCretin · 2024-08-19T22:09:18 1724105358

I sincerely hope the antitrust suit ends this practice soon. This is so obviously anticompetitive.

cogman10 · 2024-08-19T22:17:49 1724105869

How?

I also think the antitrust suit (and many more) need to happen for more obvious things like buying out competitors. However, how does publishing a list of valid IPs for their web crawlers constitute anticompetitive behavior? Anyone can publish a similar list, and any company can choose to reference those lists.

arrosenberg · 2024-08-19T22:25:38 1724106338

It allows Google to access data that is denied to competitors. It’s a clear example of Google using its market power to suppress competition.

8organicbits · 2024-08-19T22:50:14 1724107814

Hmm, the robots.txt, IP blocking, and user agent blocking are all policies chosen by the web server hosting the data. If web admins choose to block Google competitors, I'm not sure that's on Google. Can you clarify?

GreenWatermelon · 2024-08-19T22:57:22 1724108242

A nice example is the recent reddit-google deal which gives google' crawler exclusive access to reddit's data. This just serves to give google a competitive advantage over other search engine.

not2b · 2024-08-20T00:04:21 1724112261

Well yes, the Reddit-Google deal might be found to violate antitrust. Probably will, because it is so blatantly anticompetitive. But if a publication decides to give special access to search engines so they can enforce their paywall but still be findable by search, I don't think the regulators would worry about that, provided that there's a way for competing search engines to get the same access.

arrosenberg · 2024-08-20T00:31:22 1724113882

Which is it? Regulators shouldn’t worry, or we need regulations to ensure equal access to the market?

immibis · 2024-08-20T10:48:53 1724150933

regulators wouldn't worry if all search engines had equal access, even if you didn't because you're not a search engine

arrosenberg · 2024-08-20T14:35:28 1724164528

And if I had wheels, I would be a car. Theres no equal access without regulation.

sroussey · 2024-08-20T00:06:10 1724112370

Nope. That deal was for AI not search.

dns_snek · 2024-08-20T08:42:36 1724143356

This is false, the deal cuts all other search engines off from accessing Reddit. Go to Bing and search for "news site:reddit.com" and filter results by date from the past week - 0 results.

https://www.404media.co/google-is-the-only-search-engine-tha...

darkwater · 2024-08-20T01:32:19 1724117539

Antitrust kicks in exactly in cases like this: using your moat in one market (search) to win another market (AI)

wahnfrieden · 2024-08-20T13:01:07 1724158867

What do you think search is

paulddraper · 2024-08-19T23:29:04 1724110144

It's not anticompetitive behavior by Google for a website to restrict their content.

Whether by IP, user account, user agent, whatever

arrosenberg · 2024-08-20T00:36:39 1724114199

It kind of is. If Google divested search and the new company provided utility style access to that data feed, I would agree with you. Webmasters allow a limited number of crawlers based on who had market share in a specific window of time, which serves to lock in the dominance of a small number of competitors.

It may not be the kind of explicit anticompetitive behavior we normally see, but it needs to be regulated on the same grounds.

paulddraper · 2024-08-20T03:07:02 1724123222

Google's action is to declare its identity.

The website operator can do with that identity as they wish.

They could block it, accept it, accept it but only on Tuesday afternoon.

---

"Anticompetitive" would be some action by Google to suppress competitors. Offering identification is not that.

arrosenberg · 2024-08-20T05:18:35 1724131115

Regardless of whether Google has broken the law, the arrangement is clearly anticompetitive. It is not dissimilar to owning the telephone or power wires 100 years ago. Building operators were not willing to install redundant connections for the same service for each operator, and webmasters are not willing to allow unlimited numbers of crawlers on their sites. If we continue to believe in competitive and robust markets, we can't allow a monopolistic corporation to act as a private regulator of a key service that powers the modern economy.

The law may need more time to catch up, but search indexing will eventually be made a utility.

immibis · 2024-08-20T15:08:49 1724166529

Google is paying the website to restrict their content.

paulddraper · 2024-08-20T22:41:45 1724193705

In a specific case (Reddit) yes.

And that has an argument.

But in the general case no.

victorbjorklund · 2024-08-20T19:10:32 1724181032

Which sites that allows Google to index their content blocks Bing and other search engines (not other bots just scraping for other purposes)?

bdd8f1df777b · 2024-08-20T03:16:20 1724123780

If you can prove a deal made by Google and the website then you may have a case. Otherwise it is difficult to prove anything.

zo1 · 2024-08-20T20:50:32 1724187032

It's clearly meant to starve out competitors. Why else would they want website operators to definitively "know" if it's a GoogleBot IP, other than so that they can differentiate it and treat it differently.

It's all the guise of feelgood stuff like "make sure it's google, and not some abusive scraper" language. But the end-result is pretty clear. Just because they have a parallel construction of a valid reason why they're doing something, doesn't mean they don't enjoy the convenient benefits it brings.