"It feels like these new models are no longer making order of magnitude jumps, b...

elcritch · 2025-05-23T02:50:53 1747968653

Yet despite this all the LLMS I've tried struggle to scale beyond much more than a single module. They're vast improvements on that test perhaps, but in real life they still struggle to be coherent over larger projects and scales.

bckr · 2025-05-23T14:57:28 1748012248

> struggle to scale beyond much more than a single module

Yes. You must guide coding agents at the level of modules and above. In fact, you have to know good coding patterns and make these patterns explicit.

Claude 4 won’t use uv, pytest, pydantic, mypy, classes, small methods, and small files unless you tell it to.

Once you tell it to, it will do a fantastic job generating well-structured, type-checked Python.

viraptor · 2025-05-23T03:22:11 1747970531

Those are different kind of issues. Improving the quality of actions is what we're seeing here. Then for the larger projects/contexts the leaders will have to battle it out between the improved agents, or actually moving to something like RWKV and processing the whole project in one go.

morsecodist · 2025-05-23T03:35:00 1747971300

They may be different kinds of issues but they are the issues that actually matter.

avs733 · 2025-05-23T02:53:19 1747968799

3% to 40% is a 13x improvement

40% to 80% is a 2x improvement

It’s not that the second leap isn’t impressive, it just doesn’t change your perspective on reality in the same way.

viraptor · 2025-05-23T03:25:43 1747970743

Maybe... It will be interesting to see the improvements now compared to other benchmarks. Is 80->90% going to be an incremental fix with minimal impact on the next benchmark (same work but better), or is it going to be an overall 2x improvement on the remaining unsolved cases. (different approach tackling previously missed areas)

It really depends on how that remaining improvement happens. We'll see it soon though - every benchmark nearing 90% is being replaced with something new. SWE-verified is almost dead now.

energy123 · 2025-05-23T03:36:54 1747971414

80% to 100% would be an even smaller improvement but arguably the most impressive and useful (assuming the benchmark isn't in the training data)

andyferris · 2025-05-23T03:26:35 1747970795

I wouldn’t want to wait ages for Claude Code to fail 60% of the time.

A 20% risk seems more manageable, and the improvements speak to better code and problem solving skills around.

piperswe · 2025-05-23T02:29:23 1747967363

How much of that is because the models are optimizing specifically for SWE bench?

icpmacdo · 2025-05-23T02:32:26 1747967546

not that much because its getting better at all benchmarks

keeeba · 2025-05-23T06:26:26 1747981586

https://arxiv.org/abs/2309.08632