Hacker News new | past | comments | ask | show | jobs | submit login

Great observation. It would be really interesting to repeat this research with less limiting prompts. I assume they made the restriction to make it easier to parse the answers, perhaps make it two phase, let them respond unlimited, then a follow up prompt ask to compress it to a single answer. I wonder how the results will vary.



From the paper:

> To account for the response variations due to various prompt forms, we created 3 distinct prompt types asking for the solution to the AIW problem: STANDARD, THINKING, and RESTRICTED. The STANDARD prompt type asks to solve the posed problem and output the final answer in the format as described above. This does not put any specific requirements on model behavior. The THINKING prompt type extends STANDARD with the request to think carefully and double check the solution for any mistakes


To be quite honest, I assume they made the restriction so that the models would fail.

This sort of paper is becoming a genre.


You test models where they fail in any field.

The orbit of Mercury to discover GR as an example.

As all models are wrong, but some are useful, finding where they fail is how you figure out if they are useful.

As the 'AGI is near' camp has won the hype game, it is important to ground expectations for practical exploitation of the technology.

Over promising unabashed optimism is partly what caused the previous AI winters.

As the formal proof methods of mathematics proved impractical, counterexamples and the scientific method is what CS has used for decades.


They used three different kinds of prompts with varying levels of restrictions, as described in the paper.

To be quite honest, I assume you made your comment so that you could dismiss the paper without reading it.


That's a fair cop, I didn't read it.

The thing is that "LLM reasoning breaks down" simply did not surprise me enough that I thought it was worth clicking. Making LLMs fail is not hard. They're interesting for the ways that they work, not the (many, many) ways that they don't.

edit: I've had a look and I don't think any of their prompts are very good. They're certainly not how I'd write them if I wanted a current model to actually solve the problem.

The way to make me take a paper like this seriously would be if you set it up as an adversarial collaboration with a competent prompter, and that person agreed they couldn't make a generic prompt that solved the problem. "We tried three times and none worked" is not news, or at any rate not news about LLMs.


It is a proof of weakness in the current system. This makes sense and births new hypotheses.


When I added a " to the end of the prompt by accident I got a wrong answer.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: