Some counterarguments:
1. If an AI company promises that their LLM has a million token context window, but in practice it only pays attention to the first and last 30k tokens, and then hallucinates, that is a bad practice. And prompt construction does not help here - the issue is with the fundamentals of how LLMs actually work. Proof: https://arxiv.org/abs/2307.03172
2. Regarding writing the code snippet: as I described in my post, the main issue is that the model does not understand the relationships between information in the long document. So yes, it can write a script that counts the number of times the word "wizard" appears, but if I gave it a legal case of similar length, how would it write a script that extracts all of the core arguments that live across tens of pages?
I'd do it like a human would. If a human was reading the legal case they would have a notepad with them where they would note locations and summaries of key arguments, page by page. I'd code the LLM to look for something that looks like a core argument on each page (or other meaningful chunk of text) and then have it give a summary if one occurs. I may need to do some few shot prompting to give it understanding of what to look for. If you are looking for reliable structured output you need to formulate your approach to be more algorithmic and use the LLM for it's ability to work with chunks of text.
Totally agree there. And that's one of my points: you have to design around this flaw by doing things like what you proposed (or build an ontology like we did, which is also helpful). And the first step in this process is figuring out whether your task falls into a category like the ones I described.
The structured output element is really important too - subject for another post though!