Hacker News new | past | comments | ask | show | jobs | submit login

It’s basically a scaled Tree of Thoughts



In the primary CoT research paper they discuss figuring out how to train models using formal languages instead of just natural ones. I'm guessing this is one piece to the model learning tree-like reasoning.

Based on the quick searching it seems like they are using RL to provide positive/negative feedback on which "paths" to choose when performing CoT.


This seems most likely, with some special tokens thrown in to kick off different streams of thought.


To me it looks like they paired two instances of the model to feed off of each other's outputs with some sort of "contribute to reasoning out this problem" prompt. In the prior demos of 4o they did several similar demonstrations of that with audio.


To create the training data? Almost certainly something like that (likely more than two), but I think they then trained on the synthetic data created by this "conversation". There is no reason a model can't learn to do all of that, especially if you insert special tokens (like think, reflect etc that have already shown to be useful)


No I'm referring to how the chain of thought transcript seems like the output of two instances talking to each other.


Right - i don't think it's doing that. I think it has likely been fine tuned to transition between roles. But, maybe you are right.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: