Has those results been reproduced elsewhere with other benchmarks than what Goog... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		diggan 12 months ago \| parent \| context \| favorite \| on: Gemini 2.5 Pro vs. Claude 3.7 Sonnet: Coding Compa... Has those results been reproduced elsewhere with other benchmarks than what Google seems to use? Hard to trust their own benchmarks at this point, and Im not home at the moment so cant try it myself either.

llm_nerd 12 months ago [–]

They are testing for a very straightforward needle retrieval, as LLMs traditionally were terrible for this in longer contexts.

There are some more advanced tests where it's far less impressive. Just a couple of days ago Adobe released one such test- https://github.com/adobe-research/NoLiMa

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact