> and that LLMs have basically reached a plateau This is the new stochastic parr...

sensanaty · 2025-05-22T22:55:04 1747954504

And we also have a showcase from a day ago [1] of these magical autonomous AI agents failing miserably in the PRs unleashed on the dotnet codebase, where it kept reiterating it fixed tests it wrote that failed without fixing them. Oh, and multiple blatant failures that happened live on stage [2], with the speaker trying to sweep the failures under the rug on some of the simplest code imaginable.

But sure, it managed to find a name buried in some emails after being told to... Search through emails. Wow. Such magic

[1] https://news.ycombinator.com/item?id=44050152 [2] https://news.ycombinator.com/item?id=44056530

hsn915 · 2025-05-22T20:23:11 1747945391

Is this something that the models from 4 months ago were not able to do?

vessenes · 2025-05-23T15:08:56 1748012936

For a fair definition of able, yes. Those models had no ability to engage in a search of emails.

What’s special about it is that it required no handholding; that is new.

camdenreslink · 2025-05-23T15:37:42 1748014662

Is this because the models improved, or the tooling around models improved (both visible and not visible to the end user).

My impression is that the base models have not improved dramatically in the last 6 months and incremental improvements in those models is becoming extremely expensive.

vessenes · 2025-05-26T15:13:38 1748272418

Resist getting your news from Brooklyn journalists. :)

Tooling has improved, and the models have. The combo is pretty powerful.

https://aider.chat/docs/leaderboards/ will give you a flavor of the last six months of improvements. Francois Cholet (ARC AGI: https://arcprize.org/leaderboard) has gone from “No current architecture will ever beat ARC” to “o3 has beaten ARC and now we have designed ARC 2”.

At the same time, we have the first really useful 1mm token context model available with reasonably good skills across the context window (Gemini Pro 2.5), and that opens up a different category of work altogether. Reasoning models got launched to the world in the last six months, another significant dimension of improvement.

TLDR: Massive, massive increase in quality for coding models. And o3 is to my mind over the line people had in mind for generally intelligent in, say, 2018 — o3 alone is a huge improvement launched in the last six months. You can now tell o3 something like: “research the X library and architect a custom extension to that library that interfaces with my weird garage door opener; after writing the architecture implement the extension in (node/python/go) and come back in 20 minutes with something that almost certainly compiles and likely largely interfaces properly, leaving touch-up work to be done.

camdenreslink · 2025-05-27T00:59:17 1748307557

I use LLMs every day, so I get my news from myself (I couldn't even name a Brooklyn journalist?). My experience so far is they are good for greenfield development (i.e. getting a project started), and operating within a well defined scope (e.g. please define this function in this specific place with these constraints).

What I haven't seen is any LLM model consistently being able to fully implement new features or make refactors in a large existing code base (100k+ LOC, which are the code bases that most businesses have). These code bases typically require making changes across multiple layers (front end, API, service/business logic layer, data access layer, and the associated tests, even infrastructure changes). LLMs seem to ignore the conventions of the existing code and try to do their own thing, resulting in a mess.

vessenes · 2025-05-27T17:13:48 1748366028

Def a pain point. Anecdotally Claude code and aider both can be of some help. My go to method is: dump everything in the code base into Gemini and ask for an architecture spec, then ask for implementation from aider or Claude code. This 90% works 80% of the time. Well maybe 90% of the time. Notably it can deal with cross codebase interfaces and data structures, in general, with good prompting.

Dumping it at Claude 3.7 with no instructions will 100% get random rewriting - very annoying.

morepedantic · 2025-05-22T22:13:09 1747951989

The LLMs have reached a plateau. Successive generations will be marginally better.

We're watching innovation move into the use and application of LLMs.

the8472 · 2025-05-23T15:04:49 1748012689

Innovation and better application of a relatively fixed amount of intelligence got us from wood spears to the moon.

So even if the plateau is real (which I doubt given the pace of new releases and things like AlphaEvolve) and we'd only expect small fundamental improvements some "better applications" could still mean a lot of untapped potential.

morepedantic · 2025-06-05T08:01:00 1749110460

The core models have plateaued. MoE and CoT are use of LLMs. Agents are applications of LLMs. It's hard to say how far novel uses and applications will take us, but the fiery explosion at the core has turned into a smolder.

We'll continue to see incremental improvements as training sets, weights, size, and compute improve. But they're incremental.