Ok, well now that you phrase it clearly like that, it makes much more sense, so it's a test of being able to keep a relatively long context-length. Another incremental improvement I suppose.
It's not really a function of maintaining coherency across context length. It's more about whether the model can accomplish a long time horizon task when the context length of a single message isn't even close to sufficient for keeping track of the all the things that have occurred in pursuit of the task's completion.
Basically, the model has to keep some notes about its overall goals and current progress. Then the context window has to be seeded with the relevant sections from these notes to accomplish sub goals that help with the completion of the overall goal (beat the game).
The interesting part here is whether the models can even do this. A single context window isn't even close to sufficient to store all the things the model has done to drive the next action, so you have to figure out alternate methods and see if the model itself is smart enough to maintain coherency using those methods.
Pokemon is interesting because it's a test of whether these models can solve long time horizon tasks.
That's it.