Great post! I have an upcoming paper on this (which I shared with OP a few days ago) that goes into weird / illegible CoTs in reasoning models in depth.
I think the results in it support a version of the spandrel hypothesis—that the tokens aren’t causally part of the reasoning, but that RL credit assignment is weird enough in practice that it results in them being vestigially useful for reasoning (perhaps by sequentially triggering separate forward passes where reasoning happens). Where I think this differs from your formulation is that the weird tokens are still useful, just in a pretty non-standard way. I would expect worse performance if you removed them entirely, though not much worse.
This is hard to separate out from the model being slightly OOD if you remove the tokens it would normally use. I think that’s (part of) the point though: it isn’t that we can’t apply pressure against this if we tried, it’s that in practice RLVR does seem to result in this pretty consistently unless we apply some pressure against this. And applying pressure against it can end up being pretty bad, if we e.g. accidentally optimize the CoT to look more legible without actually deriving more meaning from it.
Incidentally I think the only other place I’ve seen describing this idea clearly is this post by @Caleb Biddulph, which I strongly recommend.
that the tokens aren’t causally part of the reasoning, but that RL credit assignment is weird enough in practice that it results in them being vestigially useful for reasoning (perhaps by sequentially triggering separate forward passes where reasoning happens). Where I think this differs from your formulation is that the weird tokens are still useful, just in a pretty non-standard way. I would expect worse performance if you removed them entirely, though not much worse.
Strongly agree with this! This matches my intuition as well after having looked at way too many of these when writing the original paper.
I’d also be very interested in taking a look at the upcoming paper you mentioned if you’re open to it!
While gibberish typically leads to lower rewards and naturally decreases at the beginning of RL, it can increase later when some successful gibberish trajectories get reinforced, especially for agentic SWE RL
Interesting! The R1 paper had this as a throwaway line as well:
To mitigate the issue of language mixing, we introduce a language consistency reward during RL training, which is calculated as the proportion of target language words in the CoT. Although ablation experiments show that such alignment results in a slight degradation in the model’s performance
Relatedly, I was surprised at how many people were reacting in surprise to the GPT-5 CoTs, when there was already evidence of R1, Grok, o3, QwQ all having pretty illegible CoTs often.
I’d also be very interested in taking a look at the upcoming paper you mentioned if you’re open to it!
Sure! I’m editing it for the NeurIPS camera-ready deadline, so I should have a much better version ready in the next couple weeks.
Great post! I have an upcoming paper on this (which I shared with OP a few days ago) that goes into weird / illegible CoTs in reasoning models in depth.
I think the results in it support a version of the spandrel hypothesis—that the tokens aren’t causally part of the reasoning, but that RL credit assignment is weird enough in practice that it results in them being vestigially useful for reasoning (perhaps by sequentially triggering separate forward passes where reasoning happens). Where I think this differs from your formulation is that the weird tokens are still useful, just in a pretty non-standard way. I would expect worse performance if you removed them entirely, though not much worse.
This is hard to separate out from the model being slightly OOD if you remove the tokens it would normally use. I think that’s (part of) the point though: it isn’t that we can’t apply pressure against this if we tried, it’s that in practice RLVR does seem to result in this pretty consistently unless we apply some pressure against this. And applying pressure against it can end up being pretty bad, if we e.g. accidentally optimize the CoT to look more legible without actually deriving more meaning from it.
Incidentally I think the only other place I’ve seen describing this idea clearly is this post by @Caleb Biddulph, which I strongly recommend.
Strongly agree with this! This matches my intuition as well after having looked at way too many of these when writing the original paper.
I’d also be very interested in taking a look at the upcoming paper you mentioned if you’re open to it!
As a side note / additional datapoint, I found it funny how this was just kind of mentioned in passing as a thing that happens in https://ai.meta.com/research/publications/cwm-an-open-weights-llm-for-research-on-code-generation-with-world-models/
Interesting! The R1 paper had this as a throwaway line as well:
Relatedly, I was surprised at how many people were reacting in surprise to the GPT-5 CoTs, when there was already evidence of R1, Grok, o3, QwQ all having pretty illegible CoTs often.
Sure! I’m editing it for the NeurIPS camera-ready deadline, so I should have a much better version ready in the next couple weeks.