Everyone in the open source LLM community know the standard benchmarks are all but worthless.
Cheating seems to be rampant, and by cheating I mean training on test questions + answers. Sometimes intentional, sometimes accidental. There are some good papers on checking for contamination, but no one is even bothering to use the compute to do so.
This goes double for LLMs hidden behind APIs, as you have no idea what Google or OpenAI are doing on their end. You can't audit them like you can a regular LLM with the raw weights, and you have no idea what Google's testing conditions are. Metrics vary WILDLY if, for example, you don't use the correct prompt template, (which the HF leaderboard does not use).
...Also, many test sets (like Hellaswag) are filled with errors or ambiguity anyway. Its not hidden, you can find them just randomly sampling the tests.
Then its not really a benchmark? Model trainers and researchers are not continuously testing, they dump something then move on.
The answer is standard "secret" closed source tests, performed in a controlled environment.
I know, I don't like the sound of it either, but in this case I think closed source + a single overseeing entity is the best solution, by far. Facebook already made something like this, but they only went halfway (publishing the questions while keeping the answers secret).
Interestingly, the college board might be the best entity to do this.
Colleges are apparently no longer using standardized tests so why not put that towards the AI?
It's really exactly what we need. Novel questions with minimal re-use created and curated by an independent team of experts designed to assess general intelligence across multiple dimensions.
The trick is to hide the answers to the test data with an authority that only reports your score, like Kaggle does. And then only allow a single submission for each new model to avoid data leakage. I find it a bit sad that this practice has fallen by the wayside, as it went pretty mainstream within the research community with the Netflix Prize back in 2009.
Cheating seems to be rampant, and by cheating I mean training on test questions + answers. Sometimes intentional, sometimes accidental. There are some good papers on checking for contamination, but no one is even bothering to use the compute to do so.
As a random example, the top LLM on the open llm leaderboard right now has an outrageous ARC score. Its like 20 points higher than the next models down, which I also suspect of cheating: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderb...
But who cares? Just let the VC money pour in.
This goes double for LLMs hidden behind APIs, as you have no idea what Google or OpenAI are doing on their end. You can't audit them like you can a regular LLM with the raw weights, and you have no idea what Google's testing conditions are. Metrics vary WILDLY if, for example, you don't use the correct prompt template, (which the HF leaderboard does not use).
...Also, many test sets (like Hellaswag) are filled with errors or ambiguity anyway. Its not hidden, you can find them just randomly sampling the tests.