Yes, I suspect this is the root of the issue. There are strong economic incentives to optimize for shorter sequences that produce correct answers. It’s great that this hasn’t harmed legibility of the chain of thought yet, but this pressure will likely create use of jargon that could quickly become a human-uneeadable CoT. I see this as one of the main dangers for effectively faithful CoT. And most of the reasonable hopes for aligning LLM-based AGI that I can see route through faithful CoT.
There’s still the possibility that a fresh version of the same model will understand and be happy to correct a interpret the CoT if it’s become a unique language of thought. But that’s a lot shakier than CoT can be read by any model.
As I understand it, we don’t actually see the chain of thought here but only the final submitted solution. And I don’t think that a pressure to save tokens would apply to that.
I’d guess it has something to do with whatever they’re using to automatically evaluate the performance in “hard-to-verify domains”. My understanding is that, during training, those entire proofs would have been the final outputs which the reward function (or whatever) would have taken in and mapped to training signals. So their shape is precisely what the training loop optimized – and if so, this shape is downstream of some peculiarities on that end, the training loop preferring/enforcing this output format.
If the AI is iterating on solutions, there is actually pressure to reduce the length of draft/candidate solutions. Then, it might be that OpenAI didn’t implement a clean up pass on the final solution (even though there wouldn’t be any real pressure to save tokens in the final clean up).
Yes, I suspect this is the root of the issue. There are strong economic incentives to optimize for shorter sequences that produce correct answers. It’s great that this hasn’t harmed legibility of the chain of thought yet, but this pressure will likely create use of jargon that could quickly become a human-uneeadable CoT. I see this as one of the main dangers for effectively faithful CoT. And most of the reasonable hopes for aligning LLM-based AGI that I can see route through faithful CoT.
There’s still the possibility that a fresh version of the same model will understand and be happy to correct a interpret the CoT if it’s become a unique language of thought. But that’s a lot shakier than CoT can be read by any model.
As I understand it, we don’t actually see the chain of thought here but only the final submitted solution. And I don’t think that a pressure to save tokens would apply to that.
Well, maybe there’s some transfer? Maybe habits picked up from the CoT die hard & haven’t been trained away with RLHF yet?
I’d guess it has something to do with whatever they’re using to automatically evaluate the performance in “hard-to-verify domains”. My understanding is that, during training, those entire proofs would have been the final outputs which the reward function (or whatever) would have taken in and mapped to training signals. So their shape is precisely what the training loop optimized – and if so, this shape is downstream of some peculiarities on that end, the training loop preferring/enforcing this output format.
If the AI is iterating on solutions, there is actually pressure to reduce the length of draft/candidate solutions. Then, it might be that OpenAI didn’t implement a clean up pass on the final solution (even though there wouldn’t be any real pressure to save tokens in the final clean up).
I think pressure on the final submitted solution is likely. That will encourage more insightful proofs over long monotonous case studies.