They are testing for a very straightforward needle retrieval, as LLMs traditionally were terrible for this in longer contexts.
There are some more advanced tests where it's far less impressive. Just a couple of days ago Adobe released one such test- https://github.com/adobe-research/NoLiMa
Hard to trust their own benchmarks at this point, and Im not home at the moment so cant try it myself either.