It’s basically a scaled Tree of Thoughts

qudat · 2024-09-13T02:47:43 1726195663

In the primary CoT research paper they discuss figuring out how to train models using formal languages instead of just natural ones. I'm guessing this is one piece to the model learning tree-like reasoning.

Based on the quick searching it seems like they are using RL to provide positive/negative feedback on which "paths" to choose when performing CoT.

danielmarkbruce · 2024-09-12T23:20:02 1726183202

This seems most likely, with some special tokens thrown in to kick off different streams of thought.

Zenzero · 2024-09-13T00:37:54 1726187874

To me it looks like they paired two instances of the model to feed off of each other's outputs with some sort of "contribute to reasoning out this problem" prompt. In the prior demos of 4o they did several similar demonstrations of that with audio.

danielmarkbruce · 2024-09-13T15:05:23 1726239923

To create the training data? Almost certainly something like that (likely more than two), but I think they then trained on the synthetic data created by this "conversation". There is no reason a model can't learn to do all of that, especially if you insert special tokens (like think, reflect etc that have already shown to be useful)

Zenzero · 2024-09-13T15:36:08 1726241768

No I'm referring to how the chain of thought transcript seems like the output of two instances talking to each other.

danielmarkbruce · 2024-09-13T16:44:07 1726245847

Right - i don't think it's doing that. I think it has likely been fine tuned to transition between roles. But, maybe you are right.