Interesting! The R1 paper had this as a throwaway line as well:
To mitigate the issue of language mixing, we introduce a language consistency reward during RL training, which is calculated as the proportion of target language words in the CoT. Although ablation experiments show that such alignment results in a slight degradation in the model’s performance
Relatedly, I was surprised at how many people were reacting in surprise to the GPT-5 CoTs, when there was already evidence of R1, Grok, o3, QwQ all having pretty illegible CoTs often.
I’d also be very interested in taking a look at the upcoming paper you mentioned if you’re open to it!
Sure! I’m editing it for the NeurIPS camera-ready deadline, so I should have a much better version ready in the next couple weeks.
Interesting! The R1 paper had this as a throwaway line as well:
Relatedly, I was surprised at how many people were reacting in surprise to the GPT-5 CoTs, when there was already evidence of R1, Grok, o3, QwQ all having pretty illegible CoTs often.
Sure! I’m editing it for the NeurIPS camera-ready deadline, so I should have a much better version ready in the next couple weeks.