(1) “repeated identical sequences likely are nonfunctional spandrels”
(2) “all uses of such tokens are in nonfunctional spandrels”
I think (1) is almost certainly true and (2) is certainly false.
—-
I’m pretty strongly against trying to explain away each instance of non legible english cot as it’s own unique RL configuration issue. This feels like adding epicycles, and only works to explain anything post hoc. It can’t predict that, for example, Codex 5-3 has non-latin syllables interspersed in words all the time now. You can come up with some new explanation but this seems unprincipled and entirely speculative, as you have to guess at a lab’s training setup every time a new issue in reasoning legibility is observed.
Even in pre-reasoning models you’d expect this dynamic to emerge: https://arxiv.org/abs/2407.13692. Notably, even things like being slightly compressed (ex: “linter complaining” instead of “The linter is complaining”) that you would expect from length penalty seem missing in Anthropic models.
While gibberish typically leads to lower rewards and naturally decreases at the beginning of RL, it can increase later when some successful gibberish trajectories get reinforced, especially for agentic SWE RL
—-
Still… between the fact that this stuff apparently isn’t necessary for frontier performance, and the fact that there’s a thin line between “useful but opaque” and “degenerate looping” (with frequent switches back and forth between these two modes)… this all feels to me more like “the contingent outcome of a janky / under-constrained training process” than “a glimps[3] at the essential properties of highly capable CoT.”
I genuinely don’t understand what I’m missing here. How are you accounting for the free variable of being able to optimize against the CoT for legibility? I’m not arguing that what o3 is doing is somehow “you may not like it, but this is the optimal language”, but more that “yeah of course you can get legible cot and performance if you’re willing to optimize against it”. (It’s a multidimensional thing of “how exactly do you apply that optimization pressure”, but no idea where Anthropic are on that landscape, and you do need to know to make updates on monitorability related tradeoffs).
Earlier I wrote a long reply to this and deleted it because I felt weird about it—it wasn’t a bad comment per se, but it devolved halfway through into a kind of protracted, grandiose grandstanding about Anthropic vs. OpenAI and why I think Anthropic’s models are way better… which is a fun mode for me to write in, but which felt like adding more heat than light when I looked back on it.
But, to summarize the same points more concisely and plainly:
I’m working off of a mental model that says OpenAI models are reliably “weird” in particular ways, both in CoT and out of it, and that this is less true of other labs’ models (and not true at all of Claude).
The “OpenAI weirdness” in question has to do with over-learned and indiscriminately deployed verbal reflexes. In the deleted comment I explained more precisely what I meant by that, which took more space than I want to use here; I described some concrete examples of this outside of CoT, such as:
o3′s unusual typography and writing style (in responses, not in CoT)
a strange behavior I have noticed in gpt-5.2-chat-latest but not in any of its predecessors (roughly, “announcing” its own intent to navigate a tense conversation in a particular way, e.g. “I want to say this gently (but firmly)”, typically as markdown headings; it does this very frequently, using a narrow set of words and phrase templates that recur across independent realizations of the behavior)
So, the observation that a new Codex model has some new weird CoT behavior—and, meanwhile, that there are some very new Claude models and they don’t seem less legible than their predecessors—is entirely in line with what I would have expected a few months ago. (This is a post-diction, obviously, take it as you will.)
I would be surprised if the legibility gap between Anthropic and OpenAI ever closes, or if Claude ever becomes as illegible as o3 (or worse). Or, I guess, if OpenAI models ever become as legible as the Claudes of today, although I’m less sure of that one.
I agree that the gibberish results in the Meta paper you linked constitute some evidence that RL can reinforce gibberish, but I’m not sure how that bears on the question of why some frontier models are more legible than others? “They were supervised for legibility” is one hypothesis, yes, but it might just be that that they started off with a “good” initialization that sent RL on a legible trajectory?
Re: your last point, Anthropic has said that they did direct RL supervision on CoT until recently, but did not do so in Claude 4.5 or later, although those models still went through pre-RL SFT training on traces from the earlier models. If true, this seems like evidence for my “good init” hypothesis from the previous bullet point?[1] But I’m curious what you make of this.
That is: if we suppose that frontier-scale RL always produces illegible CoT unless directly penalized for doing so, we would expect the latest Claudes to be less legible than their predecessors. But in fact they are not.
Whereas if we suppose, instead, that legibility at the end of RL depends on whether the init for RL can already do legible CoT of some form, then it makes sense that SFT on earlier traces could have “inoculated” recent Claudes against RL-induced legibility degradation.
I think there are two distinct claims:
(1) “repeated identical sequences likely are nonfunctional spandrels”
(2) “all uses of such tokens are in nonfunctional spandrels”
I think (1) is almost certainly true and (2) is certainly false.
—-
I’m pretty strongly against trying to explain away each instance of non legible english cot as it’s own unique RL configuration issue. This feels like adding epicycles, and only works to explain anything post hoc. It can’t predict that, for example, Codex 5-3 has non-latin syllables interspersed in words all the time now. You can come up with some new explanation but this seems unprincipled and entirely speculative, as you have to guess at a lab’s training setup every time a new issue in reasoning legibility is observed.
Even in pre-reasoning models you’d expect this dynamic to emerge: https://arxiv.org/abs/2407.13692. Notably, even things like being slightly compressed (ex: “linter complaining” instead of “The linter is complaining”) that you would expect from length penalty seem missing in Anthropic models.
The linked thread has examples completely unrelated to openai, ex: https://ai.meta.com/research/publications/cwm-an-open-weights-llm-for-research-on-code-generation-with-world-models/
—-
I genuinely don’t understand what I’m missing here. How are you accounting for the free variable of being able to optimize against the CoT for legibility? I’m not arguing that what o3 is doing is somehow “you may not like it, but this is the optimal language”, but more that “yeah of course you can get legible cot and performance if you’re willing to optimize against it”. (It’s a multidimensional thing of “how exactly do you apply that optimization pressure”, but no idea where Anthropic are on that landscape, and you do need to know to make updates on monitorability related tradeoffs).
Earlier I wrote a long reply to this and deleted it because I felt weird about it—it wasn’t a bad comment per se, but it devolved halfway through into a kind of protracted, grandiose grandstanding about Anthropic vs. OpenAI and why I think Anthropic’s models are way better… which is a fun mode for me to write in, but which felt like adding more heat than light when I looked back on it.
But, to summarize the same points more concisely and plainly:
I’m working off of a mental model that says OpenAI models are reliably “weird” in particular ways, both in CoT and out of it, and that this is less true of other labs’ models (and not true at all of Claude).
The “OpenAI weirdness” in question has to do with over-learned and indiscriminately deployed verbal reflexes. In the deleted comment I explained more precisely what I meant by that, which took more space than I want to use here; I described some concrete examples of this outside of CoT, such as:
o3′s unusual typography and writing style (in responses, not in CoT)
a strange behavior I have noticed in gpt-5.2-chat-latest but not in any of its predecessors (roughly, “announcing” its own intent to navigate a tense conversation in a particular way, e.g. “I want to say this gently (but firmly)”, typically as markdown headings; it does this very frequently, using a narrow set of words and phrase templates that recur across independent realizations of the behavior)
So, the observation that a new Codex model has some new weird CoT behavior—and, meanwhile, that there are some very new Claude models and they don’t seem less legible than their predecessors—is entirely in line with what I would have expected a few months ago. (This is a post-diction, obviously, take it as you will.)
I would be surprised if the legibility gap between Anthropic and OpenAI ever closes, or if Claude ever becomes as illegible as o3 (or worse). Or, I guess, if OpenAI models ever become as legible as the Claudes of today, although I’m less sure of that one.
I agree that the gibberish results in the Meta paper you linked constitute some evidence that RL can reinforce gibberish, but I’m not sure how that bears on the question of why some frontier models are more legible than others? “They were supervised for legibility” is one hypothesis, yes, but it might just be that that they started off with a “good” initialization that sent RL on a legible trajectory?
Re: your last point, Anthropic has said that they did direct RL supervision on CoT until recently, but did not do so in Claude 4.5 or later, although those models still went through pre-RL SFT training on traces from the earlier models. If true, this seems like evidence for my “good init” hypothesis from the previous bullet point?[1] But I’m curious what you make of this.
That is: if we suppose that frontier-scale RL always produces illegible CoT unless directly penalized for doing so, we would expect the latest Claudes to be less legible than their predecessors. But in fact they are not.
Whereas if we suppose, instead, that legibility at the end of RL depends on whether the init for RL can already do legible CoT of some form, then it makes sense that SFT on earlier traces could have “inoculated” recent Claudes against RL-induced legibility degradation.