Hacker News new | past | comments | ask | show | jobs | submit | more behnamoh's comments login

> they spend the same amount of "thinking time" on "what's 2+2?" as they do on complex mathematical proofs.

Not anymore. Have you seen Gemini 2.5 Pro? Ask it simple questions and it almost doesn't "think". Ask it a coding question and it'll write a long reasoning article. I think the same goes for o3.


The original o1 also didn't do this. Neither did the actual DeepSeek R1. You could even get it to answer immediately without any reasoning tokens. These highly distilled versions just lost most of their common sense for this.

Well, it does overthink quite a bit. if It can reduce overthink,it s gonna be useful

Overthink is subjectibe. It really depends on how much you value the answer.

"how long break distance does a train need if going in 100 km/hour?"

Just need a quick reply and you dont care so much (maybe showerthought)? Or is life and death depending on the answer?

The same question can need different amount of thinking.


> is life and death depending on the answer?

In this situation I suspect you'd still want the answer quickly.


In this situation you would have someone with actual knowledge of the mechanics involved do the computation using the actual data (e.g., what's the mass of the train? Which kind of breaks does it have?) instead of asking an LLM and trusting it to give the correct answer without checking.

Assuming you could find an expert like that in time, and that they will then be able to understand and solve the problem fast enough to still be helpful.

If you need the answer within a couple hours, you can probably get it for an expert; if you need to get an actionable answer within minutes, based on some back-of-the-envelope calculations, then a SOTA LLM is a much safer bet than flagging whoever seems the smartest in the room and asking them for help.


Huge assumption, there is a wide range of various parameters that goes into how accurate you need an response to be, depending on context. As sure as there exists questions that you need 100% accurate response regardless of response times, I'm sure there exists questions on the other extreme.

Why? Lets say your are designing a railway system. It does not matter if it takes 1 sec or an hour if the planning process are months long.

What I really don't like is that I can't manually decide how much thinking it Gemini should allocate to a prompt. You're right sometimes it doesn't think but for me this also happens on complex query where I WOULD want it to think. Even things like "super think about this" etc don't help, it just refuses to

Gemini 2.5 Pro is getting thinking budgets when it GAs in June (at least that's the promise).

This is available for Flash

Yes, we started with the idea of trying to replicate similar control on thinking processes for open reasoning models. They also announced the Deep Think approach at IO which goes even further and combines parallel CoTs at inference.

> I think the same goes for o3.

Definitely, in my experience. Elsewhere in the thread, OP says that open models/systems don't do this, in which case this seems like important work toward making open alternatives competitive.


Is that not just caching? If you have the same query just return the same response.

You could even put a simpler AI in front to decide if it was effectively the same query.


Has Gemini or OpenAI put out any articles on this or is this just something you noticed?

unrelated note: your blog is nice and I've been following you for a while, but as a quick suggestion: could you make the code blocks (inline or not) highlighted and more visible?

I have syntax highlighting for blocks of Python code - e.g. this one https://simonwillison.net/2025/May/27/llm-tools/#tools-in-th... - is that not visible enough?

This post has an unusually large number of code blocks without syntax highlighting since they're copy-pasted outputs from the debug tool which isn't in any formal syntax.


what are the use cases for llm, the CLI tool? I keep finding tgpt or the bulletin AI features of iTerm2 sufficient for quick shell scripting. does llm have any special features that others don't? am I missing something?

I find it extremely useful as a research tool. It can talk to probably over 100 models at this point, providing a single interface to all of them and logging full details of prompts and responses to its SQLite database. This makes it fantastic for recording experiments with different models over time.

The ability to pipe files and other program outputs into an LLM is wildly useful. A few examples:

  llm -f code.py -s 'Add type hints' > code_typed.py
  git diff | llm -s 'write a commit message'
  llm -f https://raw.githubusercontent.com/BenjaminAster/CSS-Minecraft/refs/heads/main/main.css \
    -s 'explain all the tricks used by this CSS'
It can process images too! https://simonwillison.net/2024/Oct/29/llm-multi-modal/

  llm 'describe this photo' -a path/to/photo.jpg
LLM plugins can be a lot of fun. One of my favorites is llm-cmd which adds the ability to do things like this:

  llm install llm-cmd
  llm cmd ffmpeg convert video.mov to mp4
It proposes a command to run, you hit enter to run it. I use it for ffmpeg and similar tools all the time now. https://simonwillison.net/2024/Mar/26/llm-cmd/

I'm getting a whole lot of coding done with LLM now too. Here's how I wrote one of my recent plugins:

  llm -m openai/o3 \
    -f https://raw.githubusercontent.com/simonw/llm-hacker-news/refs/heads/main/llm_hacker_news.py \
    -f https://raw.githubusercontent.com/simonw/tools/refs/heads/main/github-issue-to-markdown.html \
    -s 'Write a new fragments plugin in Python that registers issue:org/repo/123 which fetches that issue
      number from the specified github repo and uses the same markdown logic as the HTML page to turn that into a fragment'
I wrote about that one here: https://simonwillison.net/2025/Apr/20/llm-fragments-github/

LLM was also used recently in that "How I used o3 to find CVE-2025-37899, a remote zeroday vulnerability in the Linux kernel’s SMB implementation" story - to help automate running 100s of prompts: https://sean.heelan.io/2025/05/22/how-i-used-o3-to-find-cve-...


Wow what a great overview; is there a big doc to see all these options? I'd love to try it -- I've been trying `gh` copilot pulgin but this looks more appealing.

I really need to put together a better tutorial - there's a TON of documentation but it's scattered across a bunch of different places:

- The official docs: https://llm.datasette.io/

- The workshop I gave at PyCon a few weeks ago: https://building-with-llms-pycon-2025.readthedocs.io/

- The "New releases of LLM" series on my blog: https://simonwillison.net/series/llm-releases/

- My "llm" tag, which has 195 posts now! https://simonwillison.net/tags/llm/


I use NixOS seems like this got me enough to get started (I wanted Gemini)

``` # AI cli (unstable.python3.withPackages ( ps: with ps; [ llm llm-gemini llm-cmd ] )) ```

looks like most of the plugins are models and most of the functionality you demo'd in the parent comment is baked into the tool itself.

Yea a live document might be cool -- part of the interesting bit was seeing "real" type of use cases you use it for .

Anyways will give it a spin.


"LLM was used to find" is not what they did

> had I used o3 to find and fix the original vulnerability I would have, in theory [...]

they ran a scenario that they thought could have lead to finding it, which is pretty much not what you said. We don't know how much their foreshadowing crept into their LLM context, and even the article says it was also sort of chance. Please be more precise and don't give into these false beliefs of productivity. Not yet at least.


I said "LLM was also used recently in that..." which is entirely true. They used my LLM CLI tool as part of the work they described in that post.

Very fair, I expect others to confuse what you mean productivity of your tool called LLM vs. the doubt that many have on the actually productivity of LLM the large language model concept.

I don't use llm, but I have my own "think" tool (with MUCH less support than llm, it just calls openai + some special prompt I have set) and what I use it for is when I need to call an llm from a script.

Most recently I wanted a script that could produce word lists from a dictionary of 180k words given a query, like "is this an animal?" The script breaks the dictionary up into chunks of size N (asking "which of these words is an animal? respond with just the list of words that match, or NONE if none, and nothing else"), makes M parallel "think" queries, and aggregates the results in an output text file.

I had Claude Code do it, and even though I'm _already_ talking to an LLM, it's not a task that I trust an LLM to do without breaking the word list up into much smaller chunks and making loads of requests.


youre pnly a few steps away from creating a LLM synaptic network

I'm automating spending money at an exponential rate.

I never had good experience with RAG anyway, and it felt "hacky". Not to mention most of it basically died when most models started supporting +1M context.

LLMs are already stochastic. I don't want yet another layer of randomness on top.


> and it felt "hacky"

I think the pattern that coined "RAG" is outdated, that pattern being relying on cosine similarities against an index. It was a stop gap for the 4K token window era. For AI copilots, I love Claude Code, Cline's approach of just following imports and dependencies naturally. Land on a file and let it traverse.

No more crossing your fingers with match consign and hoping your reranker did drop a critical piece.


>Not to mention most of it basically died when most models started supporting +1M context.

Do most models support that much context? I don't think anything close to "most" models support 1M+ context. I'm only aware of Gemini, but I'd love to learn about others.


GPT 4.1 / mini / nano

As the context grows, all LLMs appear to turn into idiots, even just at 32k!

> We evaluate 12 popular LLMs that claim to support contexts of at least 128K tokens. While they perform well in short contexts (<1K), performance degrades significantly as context length increases. At 32K, for instance, 10 models drop below 50% of their strong short-length baselines. Even GPT-4o, one of the top-performing exceptions, experiences a reduction from an almost-perfect baseline of 99.3% to 69.7%.

https://news.ycombinator.com/item?id=44107536


This paper is slightly outdated by LLM model standards -- GPT 4.1 or Gemini 2.5 haven't been released at that time.

Yes, I mentioned that in the comment in the linked post. I wish someone was running this methodology as an ongoing project, for new models.

Ideally, isn't this a metric that should be included on all model cards? It seems like a crucial metric.


https://gimp-print.sourceforge.io/ which uses CUPS helped me resurrect an old Canon printer for which the company refused to provide updated drivers on macOS.

I was about to throw it in the recycling/trash, but I just couldn't accept that a perfectly fine hardware was crippled because the software was not updated to work on the latest macOS versions. Perplexity pointed me to Gutenprint and it worked wonderfully! The only thing that doesn't work is the scanner functionality.


Did you try http://sane-project.org for the scanner part? They have support for some Canons, maybe you're lucky?

I bought VueScan in 2014 specifically for a Canon scanner, looks like it's still around: https://www.hamrick.com/

Many years ago I remember Windows support vanished on a bunch of printers at the 32 to 64 bit transition. That was around the time I learned how printing on Linux and BSD worked, to save a printer or two.

>support vanished on a bunch of printers at the 32 to 64 bit transition

that was after the win16 to win32 transition when every single cutting edge tech Sony product I owned, many of them quite expensive (and designed for win98/winME because that was cutting edge), stopped working. I've never bought anything Sony since, and no regrets.

Some time later, Sony Pictures wanted something from me and I said, "I boycott Sony" and they said "we're a different company" and I said "change your trademark then, that's the whole point of trademark, reputation"


It was the rootkit on a CDROM that did it for me. Ever since then, I avoid Sony. As you say: reputation.

I liked Mermaid but unfortunately LLMs don't understand it well, so I switched to Latex tikz which LLMs know pretty well. At least I know Gemini 2.5 Pro does a good job at tikz. 3.7 and o1 were meh.

deepseek r1 understands mermaid very well and can correct all the mistakes of Gemini and Claude

I can second this. I’ve been using R1 to both straight up generate mermaid as well as making custom mermaid syntax generators for dynamic diagramming

If it can damage phone cameras, can it also be harmful for human eyes?

answer: back in 2019 it was known to damage both human eyes and camera chips:

https://www.laserfocusworld.com/blogs/article/14040682/safet...


It doesn't sound like there's much danger to the human eye because of the way the beam is scanned, although I certainly wouldn't like up to stare into the laser at close range. Camera chips are vulnerable since there's no cornea or vitreous humor to absorb the incident radiation, it's focused directly onto some delicate electronics that don't have the ability to heal minor damage.

I wonder if you have to watch out getting in a waymo taxi

yeah, that's self-fulfilling prophecy.

added.

So, we’ve come full circle to symbolic AI! This article essentially suggests that LLMs could be effective translators of our requests to command-line code or input to symbolic AI software, which would then yield precise solutions. However, I feel this approach is overly mechanical, and I don’t believe AGI would be achieved by creating thousands, if not millions, of MCP servers on our machines. This is especially true because MCP lacks scalability, and anyone who has had to send more than three or four function schemas to a language model knows that excessive JSON schema complexity confuses the model and reduces its performance.

I'm reminded of what happened in the later years of Cyc. They found their logical framework didn't address certain common problems, so they kept adding specialized hard-coded solutions in Lisp. LLMs are headed for AI autumn.

I think the problem here is we keep making promises we can't keep. It causes us to put too many eggs in one bakery, ironically frequently preventing us from filling in those gaps. We'd make much more progress without the railroading.

There's only so much money but come on, we're dumping trillions into highly saturated research directions where several already well funded organizations have years worth of a head start. You can't tell me that there's enough money to throw at another dozen OpenAI competitors and another dozen CoPilot competitors but we don't have enough for a handful of alternative paradigms that already show promise but will struggle to grow without funding. These are not only much cheaper investments but much less risky then betting on a scrappy startup being the top dog at their own game.


The article also suggests that you could use a proof-verifier like Lean instead. Using that capability to generate synthetic data on which to train helps too. Very large context windows have been known to help with programming, and should help with mathematical reasoning too. None of this gives you AGI, I suppose, but the important thing is it makes LLMs more reliable at mathematics.

Anyone have a link to an article exploring Lean plus MCP? EDIT: Here's a recent Arxiv paper: https://arxiv.org/abs/2404.12534v2, the keyword is "neural theorem proving"

I've just remembered: AlphaEvolve showed that LLMs can design their own "learning curricula", to help train themselves to do better at reasoning tasks. I recall these involve the AI suggesting problems that have the right amount of difficulty to be useful to train on.

I'll ramble a tiny bit more: Anybody who learns maths comes to understand that it helps to understand the "guts" of how things work. It helps to see proofs, write proofs, do homework, challenge yourself with puzzles, etc. I wouldn't be surprised if the same thing were true for LLMs. As such, I think having the LLM call out to symbolic solvers could ultimately undermine their intelligence - but using Lean to ensure rigour probably helps.


We've come back full-circle to precise and "narrow interfaces".

Long story short, it is great when humans interact with LLMs for imprecise queries, because, we can ascribe meaning to LLM output. But for precise queries, the human, or the LLM needs to speak a narrow interface to another machine.

Precision requires formalism, as what we mean by precise involves symbolism and operational definition. Where the genius of the human brain lies (and which is not yet captured in LLMs) is the insight and understanding of what it means to precisely model a world via symbolism - ie, the place where symbolism originates. As an example, humans operationally and precisely model the shared experience of "space" using the symbolism and theory of euclidean geometry.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: