> The LLM skeptics need to point out what differs with code compared to Chess, DoTA, etc from a RL perspective.
An obviously correct automatable objective function? Programming can be generally described as converting a human-defined specification (often very, very rough and loose) into a bunch of precise text files.
Sure, you can use proxies like compilation success / failure and unit tests for RL. But key gaps remain. I'm unaware of any objective function that can grade "do these tests match the intent behind this user request".
Contrast with the automatically verifiable "is a player in checkmate on this board?"
I'll hand it to you that only part of the problem is easily represented in automatic verification. It's not easy to design a good reward model for softer things like architectural choices, asking for feedback before starting a project, etc. The LLM will be trained to make the tests pass, and make the code take some inputs and produce desired outputs, and it will do that better than any human, but that is going to be slightly misaligned with what we actually want.
So, it doesn't map cleanly onto previously solved problems, even though there's a decent amount of overlap. But I'd like to add a question to this discussion:
- Can we design clever reward models that punish bad architectural choices, executing on unclear intent, etc? I'm sure there's scope beyond the naive "make code that maps input -> output", even if it requires heuristics or the like.
This is in fact not how a chess engine works. It has an evaluation function that assigns a numerical value (score) based on a number of factors (material advantage, king "safety", pawn structure etc).
These heuristics are certainly "good enough" that Stockfish is able to beat the strongest humans, but it's rarely possible for a chess engine to determine if a position results in mate.
I guess the question is whether we can write a good enough objective function that would encapsulate all the relevant attributes of "good code".
An automated objective function is indeed core to how alphago, alphazero, and other RL + deep learning approaches work. Though it is obviously much more complex, and integrated into a larger system.
The core of these approaches are "self-play" which is where the "superhuman" qualities arise. The system plays billions of games against itself, and uses the data from those games to further refine itself. It seems that an automated "referee" (objective function) is an inescapable requirement for unsupervised self-play.
I would suggest that Stockfish and other older chess engines are not a good analogy for this discussion. Worth noting though that even Stockfish no longer uses a hand written objective function on extracted features like you describe. It instead uses a highly optimized neutral network trained on millions of positions from human games.
Maybe I am misunderstanding what you are saying, but eg stockfish, given time and threads, seems very good at finding forced checkmates within 20 or more moves.
An obviously correct automatable objective function? Programming can be generally described as converting a human-defined specification (often very, very rough and loose) into a bunch of precise text files.
Sure, you can use proxies like compilation success / failure and unit tests for RL. But key gaps remain. I'm unaware of any objective function that can grade "do these tests match the intent behind this user request".
Contrast with the automatically verifiable "is a player in checkmate on this board?"