Great observation. It would be really interesting to repeat this research with less limiting prompts. I assume they made the restriction to make it easier to parse the answers, perhaps make it two phase, let them respond unlimited, then a follow up prompt ask to compress it to a single answer. I wonder how the results will vary.
> To account for the response variations due to various prompt forms, we created 3 distinct prompt types asking for the solution to the AIW problem: STANDARD, THINKING, and RESTRICTED. The STANDARD prompt type asks to solve the posed problem and output the final answer in the format as described above. This does not put any specific requirements on model behavior. The THINKING prompt type extends STANDARD with the request to think carefully and double check the solution for any mistakes
The thing is that "LLM reasoning breaks down" simply did not surprise me enough that I thought it was worth clicking. Making LLMs fail is not hard. They're interesting for the ways that they work, not the (many, many) ways that they don't.
edit: I've had a look and I don't think any of their prompts are very good. They're certainly not how I'd write them if I wanted a current model to actually solve the problem.
The way to make me take a paper like this seriously would be if you set it up as an adversarial collaboration with a competent prompter, and that person agreed they couldn't make a generic prompt that solved the problem. "We tried three times and none worked" is not news, or at any rate not news about LLMs.