At the end of the day, I fully expect large-n Hanoi and all these things to end up as yet another benchmark. Like all the needles-in-haystack or spelling tests that people used to show shortcomings of LLMs and that were actually just technical implementation artefacts and got solved pretty fast by integrating that kind of problem into training. LLMs will always have to use a slightly different approach to reasoning than humans because of these technical aspects, but that doesn't mean that they are fundamentally inferior or something. It only means we can't rely on human training data forever and have to look more towards stuff like RL.
People also like to forget that from the dawn of modern computing and AI research like 60 years ago all the way to 7 years ago, the best models in the world could barely form a few coherent sentences. If LLMs are this century's transistor, we are barely beyond the point of building-sized computers that are trying to find normal life applications.