> they spend the same amount of "thinking time" on "what's 2+2?" as they do on complex mathematical proofs.
Not anymore. Have you seen Gemini 2.5 Pro? Ask it simple questions and it almost doesn't "think". Ask it a coding question and it'll write a long reasoning article. I think the same goes for o3.
The original o1 also didn't do this. Neither did the actual DeepSeek R1. You could even get it to answer immediately without any reasoning tokens. These highly distilled versions just lost most of their common sense for this.
In this situation you would have someone with actual knowledge of the mechanics involved do the computation using the actual data (e.g., what's the mass of the train? Which kind of breaks does it have?) instead of asking an LLM and trusting it to give the correct answer without checking.
Assuming you could find an expert like that in time, and that they will then be able to understand and solve the problem fast enough to still be helpful.
If you need the answer within a couple hours, you can probably get it for an expert; if you need to get an actionable answer within minutes, based on some back-of-the-envelope calculations, then a SOTA LLM is a much safer bet than flagging whoever seems the smartest in the room and asking them for help.
Huge assumption, there is a wide range of various parameters that goes into how accurate you need an response to be, depending on context. As sure as there exists questions that you need 100% accurate response regardless of response times, I'm sure there exists questions on the other extreme.
What I really don't like is that I can't manually decide how much thinking it Gemini should allocate to a prompt. You're right sometimes it doesn't think but for me this also happens on complex query where I WOULD want it to think. Even things like "super think about this" etc don't help, it just refuses to
Yes, we started with the idea of trying to replicate similar control on thinking processes for open reasoning models. They also announced the Deep Think approach at IO which goes even further and combines parallel CoTs at inference.
Definitely, in my experience. Elsewhere in the thread, OP says that open models/systems don't do this, in which case this seems like important work toward making open alternatives competitive.
unrelated note: your blog is nice and I've been following you for a while, but as a quick suggestion: could you make the code blocks (inline or not) highlighted and more visible?
This post has an unusually large number of code blocks without syntax highlighting since they're copy-pasted outputs from the debug tool which isn't in any formal syntax.
what are the use cases for llm, the CLI tool? I keep finding tgpt or the bulletin AI features of iTerm2 sufficient for quick shell scripting. does llm have any special features that others don't? am I missing something?
I find it extremely useful as a research tool. It can talk to probably over 100 models at this point, providing a single interface to all of them and logging full details of prompts and responses to its SQLite database. This makes it fantastic for recording experiments with different models over time.
The ability to pipe files and other program outputs into an LLM is wildly useful. A few examples:
llm -f code.py -s 'Add type hints' > code_typed.py
git diff | llm -s 'write a commit message'
llm -f https://raw.githubusercontent.com/BenjaminAster/CSS-Minecraft/refs/heads/main/main.css \
-s 'explain all the tricks used by this CSS'
I'm getting a whole lot of coding done with LLM now too. Here's how I wrote one of my recent plugins:
llm -m openai/o3 \
-f https://raw.githubusercontent.com/simonw/llm-hacker-news/refs/heads/main/llm_hacker_news.py \
-f https://raw.githubusercontent.com/simonw/tools/refs/heads/main/github-issue-to-markdown.html \
-s 'Write a new fragments plugin in Python that registers issue:org/repo/123 which fetches that issue
number from the specified github repo and uses the same markdown logic as the HTML page to turn that into a fragment'
LLM was also used recently in that "How I used o3 to find CVE-2025-37899, a remote zeroday vulnerability in the Linux kernel’s SMB implementation" story - to help automate running 100s of prompts: https://sean.heelan.io/2025/05/22/how-i-used-o3-to-find-cve-...
Wow what a great overview; is there a big doc to see all these options?
I'd love to try it -- I've been trying `gh` copilot pulgin but this looks more appealing.
> had I used o3 to find and fix the original vulnerability I would have, in theory [...]
they ran a scenario that they thought could have lead to finding it, which is pretty much not what you said. We don't know how much their foreshadowing crept into their LLM context, and even the article says it was also sort of chance. Please be more precise and don't give into these false beliefs of productivity. Not yet at least.
Very fair, I expect others to confuse what you mean productivity of your tool called LLM vs. the doubt that many have on the actually productivity of LLM the large language model concept.
I don't use llm, but I have my own "think" tool (with MUCH less support than llm, it just calls openai + some special prompt I have set) and what I use it for is when I need to call an llm from a script.
Most recently I wanted a script that could produce word lists from a dictionary of 180k words given a query, like "is this an animal?" The script breaks the dictionary up into chunks of size N (asking "which of these words is an animal? respond with just the list of words that match, or NONE if none, and nothing else"), makes M parallel "think" queries, and aggregates the results in an output text file.
I had Claude Code do it, and even though I'm _already_ talking to an LLM, it's not a task that I trust an LLM to do without breaking the word list up into much smaller chunks and making loads of requests.
I never had good experience with RAG anyway, and it felt "hacky". Not to mention most of it basically died when most models started supporting +1M context.
LLMs are already stochastic. I don't want yet another layer of randomness on top.
I think the pattern that coined "RAG" is outdated, that pattern being relying on cosine similarities against an index. It was a stop gap for the 4K token window era. For AI copilots, I love Claude Code, Cline's approach of just following imports and dependencies naturally. Land on a file and let it traverse.
No more crossing your fingers with match consign and hoping your reranker did drop a critical piece.
>Not to mention most of it basically died when most models started supporting +1M context.
Do most models support that much context? I don't think anything close to "most" models support 1M+ context. I'm only aware of Gemini, but I'd love to learn about others.
As the context grows, all LLMs appear to turn into idiots, even just at 32k!
> We evaluate 12 popular LLMs that claim to support contexts of at least 128K tokens. While they perform well in short contexts (<1K), performance degrades significantly as context length increases. At 32K, for instance, 10 models drop below 50% of their strong short-length baselines. Even GPT-4o, one of the top-performing exceptions, experiences a reduction from an almost-perfect baseline of 99.3% to 69.7%.
https://gimp-print.sourceforge.io/ which uses CUPS helped me resurrect an old Canon printer for which the company refused to provide updated drivers on macOS.
I was about to throw it in the recycling/trash, but I just couldn't accept that a perfectly fine hardware was crippled because the software was not updated to work on the latest macOS versions. Perplexity pointed me to Gutenprint and it worked wonderfully! The only thing that doesn't work is the scanner functionality.
Many years ago I remember Windows support vanished on a bunch of printers at the 32 to 64 bit transition. That was around the time I learned how printing on Linux and BSD worked, to save a printer or two.
>support vanished on a bunch of printers at the 32 to 64 bit transition
that was after the win16 to win32 transition when every single cutting edge tech Sony product I owned, many of them quite expensive (and designed for win98/winME because that was cutting edge), stopped working. I've never bought anything Sony since, and no regrets.
Some time later, Sony Pictures wanted something from me and I said, "I boycott Sony" and they said "we're a different company" and I said "change your trademark then, that's the whole point of trademark, reputation"
I liked Mermaid but unfortunately LLMs don't understand it well, so I switched to Latex tikz which LLMs know pretty well. At least I know Gemini 2.5 Pro does a good job at tikz. 3.7 and o1 were meh.
It doesn't sound like there's much danger to the human eye because of the way the beam is scanned, although I certainly wouldn't like up to stare into the laser at close range. Camera chips are vulnerable since there's no cornea or vitreous humor to absorb the incident radiation, it's focused directly onto some delicate electronics that don't have the ability to heal minor damage.
So, we’ve come full circle to symbolic AI! This article essentially suggests that LLMs could be effective translators of our requests to command-line code or input to symbolic AI software, which would then yield precise solutions. However, I feel this approach is overly mechanical, and I don’t believe AGI would be achieved by creating thousands, if not millions, of MCP servers on our machines. This is especially true because MCP lacks scalability, and anyone who has had to send more than three or four function schemas to a language model knows that excessive JSON schema complexity confuses the model and reduces its performance.
I'm reminded of what happened in the later years of Cyc. They found their logical framework didn't address certain common problems, so they kept adding specialized hard-coded solutions in Lisp. LLMs are headed for AI autumn.
I think the problem here is we keep making promises we can't keep. It causes us to put too many eggs in one bakery, ironically frequently preventing us from filling in those gaps. We'd make much more progress without the railroading.
There's only so much money but come on, we're dumping trillions into highly saturated research directions where several already well funded organizations have years worth of a head start. You can't tell me that there's enough money to throw at another dozen OpenAI competitors and another dozen CoPilot competitors but we don't have enough for a handful of alternative paradigms that already show promise but will struggle to grow without funding. These are not only much cheaper investments but much less risky then betting on a scrappy startup being the top dog at their own game.
The article also suggests that you could use a proof-verifier like Lean instead. Using that capability to generate synthetic data on which to train helps too. Very large context windows have been known to help with programming, and should help with mathematical reasoning too. None of this gives you AGI, I suppose, but the important thing is it makes LLMs more reliable at mathematics.
Anyone have a link to an article exploring Lean plus MCP? EDIT: Here's a recent Arxiv paper: https://arxiv.org/abs/2404.12534v2, the keyword is "neural theorem proving"
I've just remembered: AlphaEvolve showed that LLMs can design their own "learning curricula", to help train themselves to do better at reasoning tasks. I recall these involve the AI suggesting problems that have the right amount of difficulty to be useful to train on.
I'll ramble a tiny bit more: Anybody who learns maths comes to understand that it helps to understand the "guts" of how things work. It helps to see proofs, write proofs, do homework, challenge yourself with puzzles, etc. I wouldn't be surprised if the same thing were true for LLMs. As such, I think having the LLM call out to symbolic solvers could ultimately undermine their intelligence - but using Lean to ensure rigour probably helps.
We've come back full-circle to precise and "narrow interfaces".
Long story short, it is great when humans interact with LLMs for imprecise queries, because, we can ascribe meaning to LLM output. But for precise queries, the human, or the LLM needs to speak a narrow interface to another machine.
Precision requires formalism, as what we mean by precise involves symbolism and operational definition. Where the genius of the human brain lies (and which is not yet captured in LLMs) is the insight and understanding of what it means to precisely model a world via symbolism - ie, the place where symbolism originates. As an example, humans operationally and precisely model the shared experience of "space" using the symbolism and theory of euclidean geometry.
Not anymore. Have you seen Gemini 2.5 Pro? Ask it simple questions and it almost doesn't "think". Ask it a coding question and it'll write a long reasoning article. I think the same goes for o3.
reply