Great observation. It would be really interesting to repeat this research with l...

happypumpkin · 2024-06-05T15:27:22 1717601242

From the paper:

> To account for the response variations due to various prompt forms, we created 3 distinct prompt types asking for the solution to the AIW problem: STANDARD, THINKING, and RESTRICTED. The STANDARD prompt type asks to solve the posed problem and output the final answer in the format as described above. This does not put any specific requirements on model behavior. The THINKING prompt type extends STANDARD with the request to think carefully and double check the solution for any mistakes

FeepingCreature · 2024-06-05T14:37:33 1717598253

To be quite honest, I assume they made the restriction so that the models would fail.

This sort of paper is becoming a genre.

nyrikki · 2024-06-05T14:54:00 1717599240

You test models where they fail in any field.

The orbit of Mercury to discover GR as an example.

As all models are wrong, but some are useful, finding where they fail is how you figure out if they are useful.

As the 'AGI is near' camp has won the hype game, it is important to ground expectations for practical exploitation of the technology.

Over promising unabashed optimism is partly what caused the previous AI winters.

As the formal proof methods of mathematics proved impractical, counterexamples and the scientific method is what CS has used for decades.

qsi · 2024-06-05T15:28:28 1717601308

They used three different kinds of prompts with varying levels of restrictions, as described in the paper.

To be quite honest, I assume you made your comment so that you could dismiss the paper without reading it.

FeepingCreature · 2024-06-05T19:46:24 1717616784

That's a fair cop, I didn't read it.

The thing is that "LLM reasoning breaks down" simply did not surprise me enough that I thought it was worth clicking. Making LLMs fail is not hard. They're interesting for the ways that they work, not the (many, many) ways that they don't.

edit: I've had a look and I don't think any of their prompts are very good. They're certainly not how I'd write them if I wanted a current model to actually solve the problem.

The way to make me take a paper like this seriously would be if you set it up as an adversarial collaboration with a competent prompter, and that person agreed they couldn't make a generic prompt that solved the problem. "We tried three times and none worked" is not news, or at any rate not news about LLMs.

detourdog · 2024-06-05T14:42:54 1717598574

It is a proof of weakness in the current system. This makes sense and births new hypotheses.

pawelmurias · 2024-06-05T15:10:18 1717600218

When I added a " to the end of the prompt by accident I got a wrong answer.