The CoTs themselves are well structured into paragraphs. At first, I thought this hints towards a tree search over CoT steps with some process reward model, but r1 also structures its CoTs into nice paragraphs.
The o1 evaluation doc gives—or appears to give—the full CoT for a small number of examples, that might be an interesting comparison.
I haven’t used o1 through the GUI interface so I can’t check, but hasn’t it always been the case that you can click the <thought for n seconds> triangle and get the summary of the CoT (as discussed in the ‘Hiding the Chains of Thought’ section of that same doc)?
The o1 evaluation doc gives—or appears to give—the full CoT for a small number of examples, that might be an interesting comparison.
Just had a look. One difference between the CoTs in the evaluation doc and the “CoTs” in the screenshots above is that the evaluation docs CoTs tend to begin with the model referring to itself eg. “We are asked to solve this crossword puzzle.”, “We are told that for all integer values of kk” or the problem at hand eg. “First, what is going on here? We are given:”. Deepseek r1 tends to do that as well. The above screenshots don’t show that behaviour, instead begins by talking in second person eg. “you’re right” and “here’s your table”, which sounds very much like a model response rather than CoT. In fact the last screenshot is almost the same as the response, and upon closer inspection, all the greyed out texts pass as good responses instead of the final responses in black.
I think this is strong evidence that the greyed-out text is NOT CoT but perhaps an alternative response. Thanks for prompting me!
The o1 evaluation doc gives—or appears to give—the full CoT for a small number of examples, that might be an interesting comparison.
I haven’t used o1 through the GUI interface so I can’t check, but hasn’t it always been the case that you can click the <thought for n seconds> triangle and get the summary of the CoT (as discussed in the ‘Hiding the Chains of Thought’ section of that same doc)?
Just had a look. One difference between the CoTs in the evaluation doc and the “CoTs” in the screenshots above is that the evaluation docs CoTs tend to begin with the model referring to itself eg. “We are asked to solve this crossword puzzle.”, “We are told that for all integer values of kk” or the problem at hand eg. “First, what is going on here? We are given:”. Deepseek r1 tends to do that as well. The above screenshots don’t show that behaviour, instead begins by talking in second person eg. “you’re right” and “here’s your table”, which sounds very much like a model response rather than CoT. In fact the last screenshot is almost the same as the response, and upon closer inspection, all the greyed out texts pass as good responses instead of the final responses in black.
I think this is strong evidence that the greyed-out text is NOT CoT but perhaps an alternative response. Thanks for prompting me!