A friend runs a startup where they do a lot of RL on COT on a narrow class of tasks, she shares those with me, I can’t share that, sorry. The style is incredibly similar/immediately pattern-matched though.
I think r1 should also have a similar style in its COT.
The proofs o3 outputs are going to be different from proofs it writes as it thinks about them. This is the style RLed models think in.
Only the output! I thought Mikhail was referring to the output here, as this is what we see for the IMO problems.
But as I see it now, the consensus seems to be something like “The chain of thought of new models does look like the IMO problem solutions, and if you don’t train the model to produce final answers that look nice to humans, then they will look like the chain of thought. Probably the experimental model’s answers were not yet trained to look nice”.
Is this your position? I think that’s pretty plausible.
This is false. It is exactly how RLed LLMs write.
Fwiw, I’ve recently used o3 a lot for requesting proofs, and it writes very differently.
Could you give an example of an RLed LLM that writes like these examples?
Though I agree with Rauno’s comment that it does look like the chain of thought examples from the Baker et al. paper.
A friend runs a startup where they do a lot of RL on COT on a narrow class of tasks, she shares those with me, I can’t share that, sorry. The style is incredibly similar/immediately pattern-matched though.
I think r1 should also have a similar style in its COT.
The proofs o3 outputs are going to be different from proofs it writes as it thinks about them. This is the style RLed models think in.
Did you have access to the full o3 reasoning trace, or just the final output? The two are not the same style at all.
Only the output! I thought Mikhail was referring to the output here, as this is what we see for the IMO problems.
But as I see it now, the consensus seems to be something like “The chain of thought of new models does look like the IMO problem solutions, and if you don’t train the model to produce final answers that look nice to humans, then they will look like the chain of thought. Probably the experimental model’s answers were not yet trained to look nice”.
Is this your position? I think that’s pretty plausible.
You get the gist. I don’t think I’ve ever seen this specific style, but raw reasoning traces can end up looking even weirder.