It's closer to <30k before performance degrades too much for 3.5/3.7. 200k/64k i...

jerjerjer · 2025-05-22T18:09:18 1747937358

Is there a benchmark to measure real effective context length?

Sure, gpt-4o has a context window of 128k, but it loses a lot from the beginning/middle.

brookst · 2025-05-22T20:07:42 1747944462

Here's an older study that includes Claude 3.5: https://www.databricks.com/blog/long-context-rag-capabilitie...?

evertedsphere · 2025-05-22T22:24:57 1747952697

ruler https://arxiv.org/abs/2404.06654

nolima https://arxiv.org/abs/2502.05167

bigmadshoe · 2025-05-22T19:50:07 1747943407

They often publish "needle in a haystack" benchmarks that look very good, but my subjective experience with a large context is always bad. Maybe we need better benchmarks.