"It feels like these new models are no longer making order of magnitude jumps, but are instead into the long tail of incremental improvements. It seems like we might be close to maxing out what the current iteration of LLMs can accomplish and we're into the diminishing returns phase."
Yet despite this all the LLMS I've tried struggle to scale beyond much more than a single module. They're vast improvements on that test perhaps, but in real life they still struggle to be coherent over larger projects and scales.
Those are different kind of issues. Improving the quality of actions is what we're seeing here. Then for the larger projects/contexts the leaders will have to battle it out between the improved agents, or actually moving to something like RWKV and processing the whole project in one go.
Maybe... It will be interesting to see the improvements now compared to other benchmarks. Is 80->90% going to be an incremental fix with minimal impact on the next benchmark (same work but better), or is it going to be an overall 2x improvement on the remaining unsolved cases. (different approach tackling previously missed areas)
It really depends on how that remaining improvement happens. We'll see it soon though - every benchmark nearing 90% is being replaced with something new. SWE-verified is almost dead now.
SWE bench from ~30-40% to ~70-80% this year