Hacker News new | past | comments | ask | show | jobs | submit login

Don't look at absolute number, instead think of it in terms of relative improvement.

DocVQA is a benchmark with a very strong SOTA. GPT-4 achieves 88.4, Gemini 90.9. It's only 2.5% increase, but a ~22% error reduction which is massive for real-life usecases where the error tolerance is lower.




This + some benchmarks are shitty thus rational model should be allowed to not answer them but ask claryfying questions.


Yes, a lot of those have pretty egregious annotation mistakes. Once you get in high percentage it's often worth going through your dataset with your model prediction and compare. Obviously you can't do that on academic benchmarks (though some papers still do).




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: