Hacker News new | past | comments | ask | show | jobs | submit login

> set the user agent header to the googlebot one

Also, how effective is this really? Don’t the big news sites check the IP address of the user agents that claim to be GoogleBot?




This. 12ft has never ever worked for me.


I know one website it works well on, so I still use it, but yes, most others fail.


If you would host that server on Google cloud, you would make it a lot harder already.


https://developers.google.com/search/docs/crawling-indexing/...

They provide ways to verify Googlebot IPs specifically, anyone who cares to check wouldn't be fooled by running a fake Googlebot on Googles cloud.

Likewise with Bingbot: https://www.bing.com/webmasters/help/how-to-verify-bingbot-3...


yes, where "cares" means "the lost revenue is greater than the cost of development, QA, and computational/network/storage overhead, and the impact of increased complexity, of a function that figures out whether people are faking their user agent."

It's probably orders of magnitude greater than the revenue loss from the tiny minority of people doing such things, especially given not everyone who uses tools like these will become a subscriber if blocked, so that cuts the "lost" revenue down even further.


Even if it's not worth an actual site operators time to implement such a system themselves, WAFs like Cloudflare could easily check the IP address of clients claiming to be Googlebot/Bingbot and send them to CAPTCHA Hell on the sites behalf if they're lying. That's pretty low hanging fruit for a WAF, I would be surprised if they don't do that.

edit: Indeed I just tried curling cloudflare.com with Googlebots user agent and they immediately gave me the finger (403) on the very first request.


I sincerely hope the antitrust suit ends this practice soon. This is so obviously anticompetitive.


How?

I also think the antitrust suit (and many more) need to happen for more obvious things like buying out competitors. However, how does publishing a list of valid IPs for their web crawlers constitute anticompetitive behavior? Anyone can publish a similar list, and any company can choose to reference those lists.


It allows Google to access data that is denied to competitors. It’s a clear example of Google using its market power to suppress competition.


Hmm, the robots.txt, IP blocking, and user agent blocking are all policies chosen by the web server hosting the data. If web admins choose to block Google competitors, I'm not sure that's on Google. Can you clarify?


A nice example is the recent reddit-google deal which gives google' crawler exclusive access to reddit's data. This just serves to give google a competitive advantage over other search engine.


Well yes, the Reddit-Google deal might be found to violate antitrust. Probably will, because it is so blatantly anticompetitive. But if a publication decides to give special access to search engines so they can enforce their paywall but still be findable by search, I don't think the regulators would worry about that, provided that there's a way for competing search engines to get the same access.


Which is it? Regulators shouldn’t worry, or we need regulations to ensure equal access to the market?


regulators wouldn't worry if all search engines had equal access, even if you didn't because you're not a search engine


And if I had wheels, I would be a car. Theres no equal access without regulation.


Nope. That deal was for AI not search.


This is false, the deal cuts all other search engines off from accessing Reddit. Go to Bing and search for "news site:reddit.com" and filter results by date from the past week - 0 results.

https://www.404media.co/google-is-the-only-search-engine-tha...


Antitrust kicks in exactly in cases like this: using your moat in one market (search) to win another market (AI)


What do you think search is


It's not anticompetitive behavior by Google for a website to restrict their content.

Whether by IP, user account, user agent, whatever


It kind of is. If Google divested search and the new company provided utility style access to that data feed, I would agree with you. Webmasters allow a limited number of crawlers based on who had market share in a specific window of time, which serves to lock in the dominance of a small number of competitors.

It may not be the kind of explicit anticompetitive behavior we normally see, but it needs to be regulated on the same grounds.


Google's action is to declare its identity.

The website operator can do with that identity as they wish.

They could block it, accept it, accept it but only on Tuesday afternoon.

---

"Anticompetitive" would be some action by Google to suppress competitors. Offering identification is not that.


Regardless of whether Google has broken the law, the arrangement is clearly anticompetitive. It is not dissimilar to owning the telephone or power wires 100 years ago. Building operators were not willing to install redundant connections for the same service for each operator, and webmasters are not willing to allow unlimited numbers of crawlers on their sites. If we continue to believe in competitive and robust markets, we can't allow a monopolistic corporation to act as a private regulator of a key service that powers the modern economy.

The law may need more time to catch up, but search indexing will eventually be made a utility.


Google is paying the website to restrict their content.


In a specific case (Reddit) yes.

And that has an argument.

But in the general case no.


Which sites that allows Google to index their content blocks Bing and other search engines (not other bots just scraping for other purposes)?


If you can prove a deal made by Google and the website then you may have a case. Otherwise it is difficult to prove anything.


It's clearly meant to starve out competitors. Why else would they want website operators to definitively "know" if it's a GoogleBot IP, other than so that they can differentiate it and treat it differently.

It's all the guise of feelgood stuff like "make sure it's google, and not some abusive scraper" language. But the end-result is pretty clear. Just because they have a parallel construction of a valid reason why they're doing something, doesn't mean they don't enjoy the convenient benefits it brings.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: