Hacker News new | past | comments | ask | show | jobs | submit login

While it's true that the model assigns non-zero probabilities to all sequences by design, those probabilities can get a lot smaller. E.g. replace that 99% per-step success probability with 10% and suddenly the overall chance of a correct result is truly astronomically small.

For a novel reasoning strategy, I would expect at least a few individual tokens where the base model assigns much smaller probabilities than the reinforcement-learning trained one, as opposed to just being a little smaller but spread out over many tokens. (Which would better fit a "death by a thousand cuts" scenario.)




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: