Not only is "Harry Potter problem" a misnomer, the "shortcomings" of LLM being investigated here doesn't feel novel. We know counting is a big weakness of LLMs already, so why muddle the waters if you wanted to drill down on issues related to large context recall?
Perhaps my biggest gripe: if you're going to lure readers in with an interesting name like the "Harry Potter problem", it better be either technically interesting or entertaining.
Do most readers know that if you give a so-called million token context model that many tokens, it'll actually stop paying attention after the first ~30k tokens? And that if they were to try to use this product for anything serious, they would encounter hallucinations and incompleteness that could have material implications?
Not everything needs to be entertaining to be useful.
The point is that this isn't even really useful because it's not a minimum reproduction of the problem they're actually interested in.
LLMs are bad at counting no matter what size of context is provided. If you're going to formulate a thought experiment to illustrate how an LLM stops paying attention well before the context limit, it should be an example that LLMs are known to be good at in smaller context sizes. Otherwise you might be entertaining but you're also misleading.
Well LLMs are claimed to be good at math too, and yet they can't count. Same point with the long contexts. And our actual use case (insurance) does need it to do both.
My hope from this article is to help non-AI experts figure out when they need to design around a flaw versus believe what's marketed.
>Well LLMs are claimed to be good at math too, and yet they can't count.
You're putting a lot of weight into counting. I don't know anyone who wants to use a LLM after hearing "good at math" for counting of all things. Algebra, Calculus, Statistics, hell I used Claude 3 for Special Relativity. Those are the things people will care about when you say math, not counting.
Look, just test your use case and report that lol.
Look man, Claude 3, GPT4 etc didn't work for my startup out of the box. I thought it would be helpful to tell others what I went through. Why hate on the truth?
Test the LLM on what you want it to do not what you think it should be able to do before what you want it to do. It's not hard to understand here and I'm not the only one telling you this.
Your article would have been very helpful if you'd simply did that but you didn't so it's not.
Perhaps my biggest gripe: if you're going to lure readers in with an interesting name like the "Harry Potter problem", it better be either technically interesting or entertaining.