LLMs and the Harry Potter problem

educasean · on April 23, 2024

Not only is "Harry Potter problem" a misnomer, the "shortcomings" of LLM being investigated here doesn't feel novel. We know counting is a big weakness of LLMs already, so why muddle the waters if you wanted to drill down on issues related to large context recall?

Perhaps my biggest gripe: if you're going to lure readers in with an interesting name like the "Harry Potter problem", it better be either technically interesting or entertaining.

araghuvanshi · on April 23, 2024

Do most readers know that if you give a so-called million token context model that many tokens, it'll actually stop paying attention after the first ~30k tokens? And that if they were to try to use this product for anything serious, they would encounter hallucinations and incompleteness that could have material implications?

Not everything needs to be entertaining to be useful.

lolinder · on April 23, 2024

The point is that this isn't even really useful because it's not a minimum reproduction of the problem they're actually interested in.

LLMs are bad at counting no matter what size of context is provided. If you're going to formulate a thought experiment to illustrate how an LLM stops paying attention well before the context limit, it should be an example that LLMs are known to be good at in smaller context sizes. Otherwise you might be entertaining but you're also misleading.

araghuvanshi · on April 23, 2024

Well LLMs are claimed to be good at math too, and yet they can't count. Same point with the long contexts. And our actual use case (insurance) does need it to do both.

My hope from this article is to help non-AI experts figure out when they need to design around a flaw versus believe what's marketed.

og_kalu · on April 23, 2024

>Well LLMs are claimed to be good at math too, and yet they can't count.

You're putting a lot of weight into counting. I don't know anyone who wants to use a LLM after hearing "good at math" for counting of all things. Algebra, Calculus, Statistics, hell I used Claude 3 for Special Relativity. Those are the things people will care about when you say math, not counting.

Look, just test your use case and report that lol.

araghuvanshi · on April 24, 2024

Look man, Claude 3, GPT4 etc didn't work for my startup out of the box. I thought it would be helpful to tell others what I went through. Why hate on the truth?

og_kalu · on April 24, 2024

Test the LLM on what you want it to do not what you think it should be able to do before what you want it to do. It's not hard to understand here and I'm not the only one telling you this.

Your article would have been very helpful if you'd simply did that but you didn't so it's not.

stevenhuang · on April 24, 2024

But LLMs are good at math, they just aren't good at arithmetic.

https://www.lesswrong.com/posts/qy5dF7bQcFjSKaW58/bad-at-ari...

iAkashPaul · on April 23, 2024

Seems like using RAFT(https://techcommunity.microsoft.com/t5/ai-ai-platform-blog/r...) to generate the glossary+KGs for a given doc should give better results for cross-doc or intra doc answers instead of just relying on vDB chunks alone.

araghuvanshi · on April 23, 2024

Good share, thank you! Yeah I think Contextual AI has also been doing some interesting work in this area. Glossary is definitely interesting and an area we're looking into. Curious to see what work is being done with building knowledge graphs, that's another area where we've seen positive results.

trash_cat · on April 23, 2024

This is a strange example. A LLM would write a script for you to find the amount of "Wizard" being mentioned. That is what it always doing when it comes to numbers because it knows that it is its weakness.

Edit: Going from counting numbers to finding relevant information at different pages given a task isn't a valid analogy.

Agents are mentioned but not multi-agent architectures [0], where you could have an agent responsible for insurance policies, legal definitions, and/or a bot that is responsible for the big picture in question. They would go back and forth, being expert at their field (or task) and come to a conclusion after some iterations of API calls.

Missed opportunity.

[0] https://microsoft.github.io/autogen/docs/Use-Cases/agent_cha...

polotics · on April 23, 2024

One of the big issues in this space seems yo be the wishful thinking. Have you tried to use Autogen to solve any of the hard questions mentioned in the article? I had a good go at a specific domain, witb a corpus of only a dozen thousand pages, and I agree with the article: your agents need to build a solid ontology, and if you found a way to get Autogen to do that, then please please tell us.

araghuvanshi · on April 23, 2024

Please see my comment below, and the "Why should I care" section of the post. Yes you can count the number of times the word "wizard" is mentioned, but for tasks that aren't quite as cut-and-dry (say, listing out all of the core arguments of a 100-page legal case), you cannot just write a Python script.

The agentic approach falls apart because again, a self-querying mechanism or a multi-agent framework still needs to know where in the document to look for each subset of information. That's why I argue that you need an ontology. And at that point, agents are moot. A small 7b model with a simple prompt suffices, without any of the unreliability of agents. I suggest trying agents on an actually serious document, the problems are pretty evident. That said, I do hope that they get there one day because it will be cool.

og_kalu · on April 23, 2024

LLMs see tokens not words and counting is a problem for them, high context or no.

Maybe the current state of the art LLM can't solve the kind of high value long context problems you have in mind but what I can tell you though is that you won't find that out by asking it to count.

IgorPartola · on April 23, 2024

Isn’t the solution to this particular problem to do what a human would do? I just tried this:

    Write me a short Python program that will count how many times a word occurs in a given text.
    
    Sure, I can help you with that! Here's a short Python program that counts how many times a word occurs in a given text:
    
    ```python
    def count_words(text, word):
        words = > text.lower().split()
        count = 0
        for w in words:
            if w == word.lower():
                count += 1
        return count
    
    text = "Snapchat is a fun way to share moments with friends!"
    word = "Snapchat"

    result = count_words(text, word)
    print(f"The word '{word}' occurs {result} times in the text.")
    ```
    
    You can replace the `text` variable with your own text and the `word` variable with the word you want to count. Let me know if you have any questions!

Seems to me that if you give an LLM an environment to run said program it would be able to automatically do this with the correct prompt. This doesn’t solve the insurance policy problem at all but the solution to these problems is different in my opinion.

araghuvanshi · on April 23, 2024

That's true, but the problem of long context understanding (say, "summarize each of the situations where the word 'wizard' is mentioned") remains. And that gets much closer to the insurance policy thing.

IgorPartola · on April 23, 2024

“Write me a Python program that extracts context surrounding a word from a long text, then creates a prompt to summarize the context.” Still different than the insurance policy problem.

araghuvanshi · on April 23, 2024

How much context? One sentence? Two? One paragraph? One page? It's very similar to the insurance policy problem - the text surrounding the information you're looking for, which could be surrounding it by one sentence or 10 pages, is just as important as the information itself

IgorPartola · on April 23, 2024

I mean basically this is the well known problem with LLMs: they know how to mince words but don’t understand meaning. Again, I think you didn’t present a good simple example. As presented, the Harry Potter problem is just using the wrong tool for the job and isn’t the same as the insurance policy problem.

But at the end of the day an LLM is right 80% of the time while being 100% confident 100% of the time that it gets the right answer. You can increase that 80% but I don’t see how the current breed of LLMs can learn to self doubt enough to keep trying to understand better.

Scarblac · on April 23, 2024

At this point, why get the ai involved? The script is trivial and you still need enough knowledge to judge whether it will work, and to run it.

Why doesn't the AI itself decide that writing a Python script is a good way to approach the problem?

lukev · on April 23, 2024

In the words of Charles Babbage, "I cannot rightly apprehend what confusion of ideas would lead to such a question."

LLMs (by themselves) cannot reliably count. If you expect them to, then you're falling into the common trap of extrapolating a metacognition layer where none exists.

TheCoelacanth · on April 23, 2024

Mention of that limitation is notably absent in the breathless hype about LLMs.

araghuvanshi · on April 23, 2024

Direct quote from Anthropic's website: "Opus -Our most intelligent model, which can handle complex analysis, longer tasks with multiple steps, and higher-order math and coding tasks."

So you tell me: if a regular developer reads the above, how can they surmise that the model which can do higher-order math can't count?

stevenhuang · on April 24, 2024

Yes, higher order math does not include arithmetic, that should not be confusing.

thomastjeffery · on April 23, 2024

It blows my mind how absurd the narrative around LLMs has gotten. It's incredible that this subject has been talked in circles so far that trivial realities like this aren't immediately obvious to everyone.

LLMs do not count. When you ask an LLM to count (which is something it does not do), in the end no counting has happened. No shit, Sherlock.

polotics · on April 23, 2024

So what is your point?

Claims of context-window sizes increases do imply that information can be extracted from such windows. So this is not trivial.

The counting example is a simple case, the "how much is this insurance policy going to cover this damage" is also a calculation.

thomastjeffery · on April 23, 2024

But the LLM never calculates anything! Why in the world would I ever expect a calculated result?

An LLM doesn't even answer a question, either: it continues a prompt. That continuation looks like an answer to you and me; but to the LLM, it contains nothing more than the most likely tokens.

I keep hearing this story that an LLM is capable of objectivity, and that it's just relatively bad at it. The LLM literally never does objectivity. It can't be bad at a thing it never does.

Every time someone calls this sort of interaction a "limitation", they are only obfuscating the narrative with a useless anthropomorphization. The LLM is not a person. It is not a mind. It does not perform objective thought. It is a statistical model that provides the most likely text. No more, no less.

madrox · on April 23, 2024

I, too, could not read a chapter of Harry Potter and then tell you how many times a word was used. This isn't what my brain (and by extension LLMs) is good at. However, if you told me ahead of time that was my goal for reading a chapter, I'd approach reading differently. I might have a scratch pad for tallying. Or I might just do a word find on a document. I'd design a framework to solve the problem.

"The Harry Potter Problem" has the feel of a strawman. LLMs are not universal problem solvers. You still have to break down tasks or give it a framework for working things through. If you ask an LLM to produce a code snippet for word counting, it will do great. Maybe that isn't as sexy, but what are you really trying to achieve?

jackpirate · on April 23, 2024

They very clearly explain why this matters in the "Why should I care?" section. Partially quoting them:

> Harry Potter is an innocent example, but this problem is far more costly when it comes to higher value use-cases. For example, we analyze insurance policies. They’re 70-120 pages long, very dense and expect the reader to create logical links between information spread across pages (say, a sentence each on pages 5 and 95). So, answering a question like “what is my fire damage coverage?” means you have to read: Page 2 (the premium), Page 3 (the deductible and limit), Page 78 (the fire damage exclusions), Page 94 (the legal definition of “fire damage”).

It's not at all obvious how you could write code to do that for you. Solving the "Harry Potter Problem" as stated seems like a natural prerequisite for doing this much more high stakes (and harder to benchmark) task, even if there are "better" ways of solving the Harry Potter problem.

lolinder · on April 23, 2024

> Solving the "Harry Potter Problem" as stated seems like a natural prerequisite for doing this much more high stakes (and harder to benchmark) task

Not really. The "Harry Potter Problem" as formulated is asking an LLM to solve a problem that they are architecturally unsuited for. They do poorly at counting and similar algorithms tasks no matter the size of the context provided. The correct approach to allowing an AI agent to solve a problem like this one would be (as OP indicates) to have it recognize that this is an algorithmic challenge that it needs to write code to solve, then have it write the code and execute it.

Asking specific questions about your insurance policy is a qualitatively different type of problem that algorithms are bad at, but it's the kind of problem that LLMs are already very good at in smaller context windows. Making progress on that type of problem requires only extending a model's capabilities to use the context, not simultaneously building out a framework for solving algorithmic problems.

So if anything it's the reverse: solving the insurance problem would be a prerequisite to solving the Harry Potter Problem.

causal · on April 23, 2024

Word counting and summarizing key information are wildly different problems though

og_kalu · on April 23, 2024

Not really.

LLMs can't count well. This is in large part a tokenization issue. Doesn't mean they couldn't answer all those kind of questions. Maybe the current state of the art can't. But you won't find out by asking it to count.

araghuvanshi · on April 23, 2024

Some counterarguments: 1. If an AI company promises that their LLM has a million token context window, but in practice it only pays attention to the first and last 30k tokens, and then hallucinates, that is a bad practice. And prompt construction does not help here - the issue is with the fundamentals of how LLMs actually work. Proof: https://arxiv.org/abs/2307.03172 2. Regarding writing the code snippet: as I described in my post, the main issue is that the model does not understand the relationships between information in the long document. So yes, it can write a script that counts the number of times the word "wizard" appears, but if I gave it a legal case of similar length, how would it write a script that extracts all of the core arguments that live across tens of pages?

doug_durham · on April 23, 2024

I'd do it like a human would. If a human was reading the legal case they would have a notepad with them where they would note locations and summaries of key arguments, page by page. I'd code the LLM to look for something that looks like a core argument on each page (or other meaningful chunk of text) and then have it give a summary if one occurs. I may need to do some few shot prompting to give it understanding of what to look for. If you are looking for reliable structured output you need to formulate your approach to be more algorithmic and use the LLM for it's ability to work with chunks of text.

araghuvanshi · on April 23, 2024

Totally agree there. And that's one of my points: you have to design around this flaw by doing things like what you proposed (or build an ontology like we did, which is also helpful). And the first step in this process is figuring out whether your task falls into a category like the ones I described.

The structured output element is really important too - subject for another post though!

jmkni · on April 23, 2024

But you could write code to do exactly that easily though

So surely it’s not about LLMs being able to do that, but being smart enough to understand “hey this is something I could write a Python script to do” and be able to write the script, feed the Harry Potter chapter into it, run the script and parse the results (in the same way a human would do)?

lolinder · on April 23, 2024

Yes, and that's a completely different kind of problem than extending a model's ability to use its context window effectively.

If you solve the problem that way you don't even really need the Harry Potter chapter in the context at all, you could put it as an external document that the agent executes code against. This makes it qualitatively a different problem than the insurance policy questions that the article moves on to.

iLoveOncall · on April 23, 2024

> I, too, could not read a chapter of Harry Potter and then tell you how many times a word was used. This isn't what my brain (and by extension LLMs) is good at. However, if you told me ahead of time that was my goal for reading a chapter, I'd approach reading differently. I might have a scratch pad for tallying. Or I might just to a word find on a document. I'd design a framework to solve the problem,

What is the relevance in what a human would do? This is not a human and does not work like a human.

I would expect any piece of software that allows me to input a text and ask it to count occurences of words to do so accurately.

doug_durham · on April 23, 2024

Why would you expect that? You have no basis to have that expectation. The product isn't being presented as something that can do that.

nicklecompte · on April 23, 2024

You absolutely could count the number of times "wizard" was used if you had the book in front of you. Similarly the LLM does have the chapter available "to look at." Documents pasted in the context window aren't ethereal.

This explanation/excuse doesn't hold water.

lolinder · on April 23, 2024

Part of the confusion here is that some people (apparently including you) use the word "LLM" to refer to the entire system that's built up around the language model itself, while others (like OP) are specifically referring to the large language model.

The large language model's context window absolutely is ephemeral. By the time inference is begun all you have is a giant vector that represents the context to date. This means that the model itself does not have the text available to look at, it only has the encoded "memory" of that text.

OP is simply saying that the underlying model is unsuitable for solving problems like this directly, so it makes a bad example for how models don't use their context effectively. A production grade AI agent should be able to solve problems like this, but it will likely do that through external scaffolding, not through improvements to the model itself, whereas improvements to the context window will probably need to occur at the model level.

kemiller · on April 23, 2024

Yeah. Humans can't count more than ~5 things intuitively; we have to run the "counting algorithm" in our heads. We just learn it so early in life that we don't really think of it as an algorithm. Not surprising at all that LLMs have the same limitation, but fortunately computers are extremely good at running algorithms once instructed to do so.

seangrogg · on April 23, 2024

This feels... a bit obvious to the point of being silly?

It is fairly well-established that context windows are a general issue among LLMs due to SOTA context windows still being somewhere greater than linear. It's also fairly well-established that LLMs aren't necessarily good at things they aren't trained at.

If you are unwilling or unable to throw enough hardware to overcome the context window problem you'll need to reduce the context. If you're unwilling or unable to train the LLM to task you'll have to restructure information such that the task is more tractable.

I'm glad to see that given their constraints they chose a sensible solution for the business, but overall this really seems like a series of known limitations being called out and doesn't feel like it's a good look coming from a company that touts leveraging AI for pulling information from documents and integrating with existing systems...

araghuvanshi · on April 23, 2024

I don't think that this is obvious at all. Yes, AI people who read papers on arxiv and know what "SOTA" stands for know it, but that is no longer the main user base of LLMs.

This is meant to be for the developer who doesn't fit the above profile and thinks a model that has a million token context window and "can handle complex analysis, longer tasks with multiple steps, and higher-order math and coding tasks" (direct quote from Anthropic's website), actually can do those things.

seangrogg · on April 24, 2024

Valid! I think the disparity is that the article appears to be written for a fairly technical crowd but the expectations appear to come from what these particular models are marketing. Most that are fine-tuning LLMs or aware of LongRoPe for extending context windows are probably consumers of research/white papers rather than marketing material.

Having read some of your other comments it appears that part of the issue is that you were marketed a 1 million token context window and research has shown that's not quite the case. That said, the article doesn't do a good job of painting that picture - it is alluded to with "all fail at this task despite having big context windows" but I think it's worth being crystal clear here that the marketing says 1m and that is disingenuous in your experience and backed by research findings.

amiantos · on April 23, 2024

I think some commenters are missing the point of this post by pointing out an LLM is the wrong tool for the Harry Potter task, because that's literally the point of the post, that people are actively trying to use LLMs in this way, right now. A lot of hucksters and former/current sales people, specifically.

The reason long context windows are advertised is because there are an awful lot of people out there trying to make money replacing customer service agents with LLM-powered chatbots. In order to power them naively (which is the only way sales people who have never built software before know how), you need to feed them a context window full of all your industry/product specific knowledge and then hope that the LLM answers the same way.

But, they don't, and they can't, so you have to spend a lot of time trying to figure out how to tie the LLM down so it responds exactly the way you want, which sort of ruins all the hype and mystery for common people around LLMs. It's been sold as a miracle that can replace people, but the truth is, well, not that, which really hampers the sales process. I think we're seeing this with Elon Musk desperately trying to push people into "FSD". People aren't impressed by AIs that aren't as good as people, if not better than people, at doing whatever task they are supposed to do.

_akhe · on April 23, 2024

I've made AI assistants that are perfectly accurate with products, pricing, etc. yet still maintain a human quality: https://github.com/bennyschmidt/ragdoll-studio/tree/master/e...

You can accomplish this with RAG.

Your overall point is taken though, the LLM itself is not enough, fine-tuning is not always feasible, and I think no matter how good an AI persona gets at, say, teaching yoga - for some yoga students it will never replace an in-person instructor.

However for a game NPC, online agent, Discord bot, etc. not to mention research, translation, tutorials, summarizing, etc. there is a lot of present day utility for LLMs.

amiantos · on April 24, 2024

I find that the level of hype is very correlated to how much the author is involved in making his own LLM chatbot based products, which is not surprising.

Interesting idea with ragdoll, but I'd hate to try to compete with Tavern/Pygmalion cards + Lorebooks, seems like there is a critical mass there already for RP chatbots.

_akhe · on April 24, 2024

I could do a better job with messaging, but Ragdoll isn't RP chatbots, although you can chat with them to test if they know how to speak correctly. You could form friendships with them I suppose if you want. But the main purpose is for creative deliverables: Concept art, music, videos, copy, voiceovers, SFX, etc. It's like you're building an AI cast and crew for your: Story, game, video, or to commission as an artist or musician.

There is a chat mode where the UI looks like a chat (communication is key with these guys), but there are also views like Picture mode where you paste images to create concept art or upload a still to generate a movie from - so it's more like a creative studio with different views (one of which is chat).

The best part of it all is you're like the boss or conductor, directing a staff of 1 or 2 or hundreds of AI personas all with specific knowledge and abilities to create whatever you need.

lolinder · on April 23, 2024

We're focused on the bad example because it's literally the title of the article and the model's inability to solve that problem has nothing to do with context windows and everything to do with "when all you have is a hammer".

It doesn't matter if the context window is large or small, the Harry Potter Problem as formulated is going to be just as hard because it's not a problem with false advertising in context window sizes, it's a problem inherent to the computing paradigm.

A version of the Harry Potter Problem that was formulated around a model's ability to recall specific scenes of a novel would be much more useful as an illustration of the limitations of the supposedly-large context windows.

araghuvanshi · on April 23, 2024

Well the same principle of false advertising re: context window sizes also applies to its inability to count, no? AI companies claim that their models can do math, so wouldn't a regular developer assume that they can also count?

And if I can't trust a so-called SOTA model to partially answer - say, recall each mention of the word "wizard" instead of just giving me the wrong answer - then why should I trust it to list out specific scenes? That's even harder to benchmark.

BeFlatXIII · on April 23, 2024

> It's been sold as a miracle that can replace people, but the truth is, well, not that, which really hampers the sales process

How soon until the hype fades and we enjoy our next AI winter?

StephenSmith · on April 23, 2024

"If the only tool you have is a hammer, you tend to see every problem as a nail." -Abraham Maslow

Jerrrry · on April 23, 2024

Just as notable: my vacuum makes a workable, albeit poor, rake outside.

araghuvanshi · on April 23, 2024

Then why do the creators of this vacuum advertise the fact that it's really good at raking? And unlike your analogy, to actually figure out that it's bad at raking you have to read a bunch of academic papers?

doug_durham · on April 23, 2024

Where have you read that creator of LLMs say their products are awesome at counting? It's just the opposite.

araghuvanshi · on April 23, 2024

I'm talking about the fact that they boast about their models having large context windows. And Anthropic says: "Opus - Our most intelligent model, which can handle complex analysis, longer tasks with multiple steps, and higher-order math and coding tasks." So if I were a non-AI expert, would I not infer that because it can do "higher order math tasks" it can also count?

Jerrrry · on April 23, 2024

>complex analysis, longer tasks with multiple steps, and higher-order math and coding tasks.

>counting

Pick one.

This is very similar to the "precision" misconception regarding floating point numbers.

The answer isn't wrong, it's just imprecise.

Hallucinations are a misnomer.

You are trying to get exact integer<>word accuracy from an architecture that is innately probabilistic, and where atomically it clashes; words get tokenized, so arithmetic is difficult at a microscale - the carry bit likely won't make it to the (needed transformer) context to work, since usually, most numbers don't overflow on average when summed.

It can, however, output a small program - with high confidence - that it can self-evaluate for functional proximity, then use that to help arrive at an answer.

This is a proto-Mixture of Experts model, achieved by another hyper-visor or guard dog LLM.

araghuvanshi · on April 23, 2024

Why should I? If a person told you that they can multiply, divide, add and subtract, would you not also assume that they can at least count?

The point here is: the justifications from AI engineers for why counting vs math aren't the same task, while valid, are irrelevant because marketing never brings up the limitation in the first place. So any logical person who doesn't know a lot about AI will arrive at a logical, albeit practically incorrect conclusion.

Jerrrry · on April 24, 2024

>If a person told you that they can multiply, divide, add and subtract, would you not also assume that they can at least count?

But that's not what they said; to be fair. They said it can do complex math - not simple math, repeatedly, many times, by one inference.

The architecture just clashes against the intent too much to arrive at a useful/acceptable answer.

Had you crafted a larger prompt that recursively divides the context into n amount of separation buckets, then sum them (inverted binary tree wise), you'd likely have better luck with the carry bits tallying correctly.

araghuvanshi · on April 24, 2024

Fair, valid point. I do admit that this is far from a perfect analysis. I do hope, though, that it helps people at least classify their problems into categories where they need to design around the flaw rather than just assuming that the thing “just works”. I appreciate the discussion though!