yes, where "cares" means "the lost revenue is greater than the cost of development, QA, and computational/network/storage overhead, and the impact of increased complexity, of a function that figures out whether people are faking their user agent."
It's probably orders of magnitude greater than the revenue loss from the tiny minority of people doing such things, especially given not everyone who uses tools like these will become a subscriber if blocked, so that cuts the "lost" revenue down even further.
Even if it's not worth an actual site operators time to implement such a system themselves, WAFs like Cloudflare could easily check the IP address of clients claiming to be Googlebot/Bingbot and send them to CAPTCHA Hell on the sites behalf if they're lying. That's pretty low hanging fruit for a WAF, I would be surprised if they don't do that.
edit: Indeed I just tried curling cloudflare.com with Googlebots user agent and they immediately gave me the finger (403) on the very first request.
I also think the antitrust suit (and many more) need to happen for more obvious things like buying out competitors. However, how does publishing a list of valid IPs for their web crawlers constitute anticompetitive behavior? Anyone can publish a similar list, and any company can choose to reference those lists.
Hmm, the robots.txt, IP blocking, and user agent blocking are all policies chosen by the web server hosting the data. If web admins choose to block Google competitors, I'm not sure that's on Google. Can you clarify?
A nice example is the recent reddit-google deal which gives google' crawler exclusive access to reddit's data. This just serves to give google a competitive advantage over other search engine.
Well yes, the Reddit-Google deal might be found to violate antitrust. Probably will, because it is so blatantly anticompetitive. But if a publication decides to give special access to search engines so they can enforce their paywall but still be findable by search, I don't think the regulators would worry about that, provided that there's a way for competing search engines to get the same access.
This is false, the deal cuts all other search engines off from accessing Reddit. Go to Bing and search for "news site:reddit.com" and filter results by date from the past week - 0 results.
It kind of is. If Google divested search and the new company provided utility style access to that data feed, I would agree with you. Webmasters allow a limited number of crawlers based on who had market share in a specific window of time, which serves to lock in the dominance of a small number of competitors.
It may not be the kind of explicit anticompetitive behavior we normally see, but it needs to be regulated on the same grounds.
Regardless of whether Google has broken the law, the arrangement is clearly anticompetitive. It is not dissimilar to owning the telephone or power wires 100 years ago. Building operators were not willing to install redundant connections for the same service for each operator, and webmasters are not willing to allow unlimited numbers of crawlers on their sites. If we continue to believe in competitive and robust markets, we can't allow a monopolistic corporation to act as a private regulator of a key service that powers the modern economy.
The law may need more time to catch up, but search indexing will eventually be made a utility.
It's clearly meant to starve out competitors. Why else would they want website operators to definitively "know" if it's a GoogleBot IP, other than so that they can differentiate it and treat it differently.
It's all the guise of feelgood stuff like "make sure it's google, and not some abusive scraper" language. But the end-result is pretty clear. Just because they have a parallel construction of a valid reason why they're doing something, doesn't mean they don't enjoy the convenient benefits it brings.
Also, how effective is this really? Don’t the big news sites check the IP address of the user agents that claim to be GoogleBot?