Very much disagree. Current AI benchmarks are quite arbitrary as evidenced by the ability of a model to be fitted to a particular benchmark. Like the closest benchmark to objectivity is “does it answer this question factually” and benchmarks like that are just as failable really because who decides what questions we ask? The same struggles happen when we try to measure human intelligence. The more complex the algorithm the harder it is to quantify because there are so many parameters. I could easily contrive some “search engine benchmark”, but it wouldn’t be that useful because it’s only adherent to my own subjective definition of what it means for a search engine to be good.