What models are you comparing to, though? For o1/o3 you’re just getting a summary, so I’d expect those to be more structured/understandable whether or not the raw reasoning is.
Yeah. Apart from DeepSeek-R1, the only other major model which shows its reasoning process verbatim is “Gemini 2.0 Flash Thinking Experimental”. A comparison between the CoT traces of those two would be interesting.
r1’s reasoning feels conversational. Messy, high error rate, often needs to backtrack. Stream of thought consciousness rambling.
Other models’ reasoning feels like writing. Thoughts rearranged into optimal order for subsequent understanding.
In some sense you expect that doing SFT or RLHF with a bunch of high quality writing makes models do the latter and not the former.
Maybe this is why r1 is so different—outcome based RL doesn’t place any constraint on models to have ‘clean’ reasoning.
What models are you comparing to, though? For o1/o3 you’re just getting a summary, so I’d expect those to be more structured/understandable whether or not the raw reasoning is.
Yeah. Apart from DeepSeek-R1, the only other major model which shows its reasoning process verbatim is “Gemini 2.0 Flash Thinking Experimental”. A comparison between the CoT traces of those two would be interesting.