The proofs look very different from how LLMs typically write, and I wonder how that emerged. Much more concise. Most sentences are not fully grammatically complete. A bit like how a human would write if they don’t care about form and only care about content and being logically persuasive.
The writing style looks fairly similar to the examples shown in Baker et al. (2025), so it seems plausible that this is a general consequence of doing a lot of RL training, rather than something specific to the methodology used for this model. It’s still concerning, but I’m happy that it doesn’t look noticeably less readable than the examples in the Baker et al paper.
Yes, I suspect this is the root of the issue. There are strong economic incentives to optimize for shorter sequences that produce correct answers. It’s great that this hasn’t harmed legibility of the chain of thought yet, but this pressure will likely create use of jargon that could quickly become a human-uneeadable CoT. I see this as one of the main dangers for effectively faithful CoT. And most of the reasonable hopes for aligning LLM-based AGI that I can see route through faithful CoT.
There’s still the possibility that a fresh version of the same model will understand and be happy to correct a interpret the CoT if it’s become a unique language of thought. But that’s a lot shakier than CoT can be read by any model.
As I understand it, we don’t actually see the chain of thought here but only the final submitted solution. And I don’t think that a pressure to save tokens would apply to that.
I’d guess it has something to do with whatever they’re using to automatically evaluate the performance in “hard-to-verify domains”. My understanding is that, during training, those entire proofs would have been the final outputs which the reward function (or whatever) would have taken in and mapped to training signals. So their shape is precisely what the training loop optimized – and if so, this shape is downstream of some peculiarities on that end, the training loop preferring/enforcing this output format.
If the AI is iterating on solutions, there is actually pressure to reduce the length of draft/candidate solutions. Then, it might be that OpenAI didn’t implement a clean up pass on the final solution (even though there wouldn’t be any real pressure to save tokens in the final clean up).
A friend runs a startup where they do a lot of RL on COT on a narrow class of tasks, she shares those with me, I can’t share that, sorry. The style is incredibly similar/immediately pattern-matched though.
I think r1 should also have a similar style in its COT.
The proofs o3 outputs are going to be different from proofs it writes as it thinks about them. This is the style RLed models think in.
Only the output! I thought Mikhail was referring to the output here, as this is what we see for the IMO problems.
But as I see it now, the consensus seems to be something like “The chain of thought of new models does look like the IMO problem solutions, and if you don’t train the model to produce final answers that look nice to humans, then they will look like the chain of thought. Probably the experimental model’s answers were not yet trained to look nice”.
Is this your position? I think that’s pretty plausible.
It sounds a lot like what a human trying to tackle a difficult maths problem might mutter as they did so, without those bits trimmed out of the final product for clarity.
The proofs look very different from how LLMs typically write, and I wonder how that emerged. Much more concise. Most sentences are not fully grammatically complete. A bit like how a human would write if they don’t care about form and only care about content and being logically persuasive.
The writing style looks fairly similar to the examples shown in Baker et al. (2025), so it seems plausible that this is a general consequence of doing a lot of RL training, rather than something specific to the methodology used for this model. It’s still concerning, but I’m happy that it doesn’t look noticeably less readable than the examples in the Baker et al paper.
Maybe tokens are expensive, so there’s optimization pressure towards conciseness?
Yes, I suspect this is the root of the issue. There are strong economic incentives to optimize for shorter sequences that produce correct answers. It’s great that this hasn’t harmed legibility of the chain of thought yet, but this pressure will likely create use of jargon that could quickly become a human-uneeadable CoT. I see this as one of the main dangers for effectively faithful CoT. And most of the reasonable hopes for aligning LLM-based AGI that I can see route through faithful CoT.
There’s still the possibility that a fresh version of the same model will understand and be happy to correct a interpret the CoT if it’s become a unique language of thought. But that’s a lot shakier than CoT can be read by any model.
As I understand it, we don’t actually see the chain of thought here but only the final submitted solution. And I don’t think that a pressure to save tokens would apply to that.
Well, maybe there’s some transfer? Maybe habits picked up from the CoT die hard & haven’t been trained away with RLHF yet?
I’d guess it has something to do with whatever they’re using to automatically evaluate the performance in “hard-to-verify domains”. My understanding is that, during training, those entire proofs would have been the final outputs which the reward function (or whatever) would have taken in and mapped to training signals. So their shape is precisely what the training loop optimized – and if so, this shape is downstream of some peculiarities on that end, the training loop preferring/enforcing this output format.
If the AI is iterating on solutions, there is actually pressure to reduce the length of draft/candidate solutions. Then, it might be that OpenAI didn’t implement a clean up pass on the final solution (even though there wouldn’t be any real pressure to save tokens in the final clean up).
I think pressure on the final submitted solution is likely. That will encourage more insightful proofs over long monotonous case studies.
It’s a research prototype, so probably not style tuned
This is OpenAI cot style. See it in the original o1 blog post. https://openai.com/index/learning-to-reason-with-llms/
Here is a screenshot of a chain of thought from the blog post you link:
This looks different from the IMO solutions to me and doesn’t have the patterns I mentioned. E.g., the sentences are grammatically complete.
This is false. It is exactly how RLed LLMs write.
Fwiw, I’ve recently used o3 a lot for requesting proofs, and it writes very differently.
Could you give an example of an RLed LLM that writes like these examples?
Though I agree with Rauno’s comment that it does look like the chain of thought examples from the Baker et al. paper.
A friend runs a startup where they do a lot of RL on COT on a narrow class of tasks, she shares those with me, I can’t share that, sorry. The style is incredibly similar/immediately pattern-matched though.
I think r1 should also have a similar style in its COT.
The proofs o3 outputs are going to be different from proofs it writes as it thinks about them. This is the style RLed models think in.
Did you have access to the full o3 reasoning trace, or just the final output? The two are not the same style at all.
Only the output! I thought Mikhail was referring to the output here, as this is what we see for the IMO problems.
But as I see it now, the consensus seems to be something like “The chain of thought of new models does look like the IMO problem solutions, and if you don’t train the model to produce final answers that look nice to humans, then they will look like the chain of thought. Probably the experimental model’s answers were not yet trained to look nice”.
Is this your position? I think that’s pretty plausible.
You get the gist. I don’t think I’ve ever seen this specific style, but raw reasoning traces can end up looking even weirder.
It sounds a lot like what a human trying to tackle a difficult maths problem might mutter as they did so, without those bits trimmed out of the final product for clarity.