As I understand it, we don’t actually see the chain of thought here but only the final submitted solution. And I don’t think that a pressure to save tokens would apply to that.
I’d guess it has something to do with whatever they’re using to automatically evaluate the performance in “hard-to-verify domains”. My understanding is that, during training, those entire proofs would have been the final outputs which the reward function (or whatever) would have taken in and mapped to training signals. So their shape is precisely what the training loop optimized – and if so, this shape is downstream of some peculiarities on that end, the training loop preferring/enforcing this output format.
If the AI is iterating on solutions, there is actually pressure to reduce the length of draft/candidate solutions. Then, it might be that OpenAI didn’t implement a clean up pass on the final solution (even though there wouldn’t be any real pressure to save tokens in the final clean up).
As I understand it, we don’t actually see the chain of thought here but only the final submitted solution. And I don’t think that a pressure to save tokens would apply to that.
Well, maybe there’s some transfer? Maybe habits picked up from the CoT die hard & haven’t been trained away with RLHF yet?
I’d guess it has something to do with whatever they’re using to automatically evaluate the performance in “hard-to-verify domains”. My understanding is that, during training, those entire proofs would have been the final outputs which the reward function (or whatever) would have taken in and mapped to training signals. So their shape is precisely what the training loop optimized – and if so, this shape is downstream of some peculiarities on that end, the training loop preferring/enforcing this output format.
If the AI is iterating on solutions, there is actually pressure to reduce the length of draft/candidate solutions. Then, it might be that OpenAI didn’t implement a clean up pass on the final solution (even though there wouldn’t be any real pressure to save tokens in the final clean up).
I think pressure on the final submitted solution is likely. That will encourage more insightful proofs over long monotonous case studies.