This makes sense! I agree with points 1-3. Point 4 sounds plausible, but why not just filter the CoTs for legibility before training on them? This seems like an obvious thing to do; see here for one specific proposal by Daniel Kokotajlo. It’s possible that OpenAI indeed did this, but since almost all transcripts produced by o3 contain some weirdness, it wasn’t possible to get a perfectly clean dataset and some of o3′s reasoning patterns were transmitted to 5.n models—we don’t know the quantitative differences in CoT weirdness between o3 and 5.n models.
For point 5, I strongly disagree that we’re seeing 1 in 1000 levels of cherry-picking for weirdness. The anti-scheming paper shows that the weird words are more common in GPQA transcripts than in scheming evals, with “overshadow” appearing up to 60k times more frequently than in webtext:
The metagaming blog post shows that in alignment evals, the term “Redwood” appears up to 0.35 times in 1000 reasoning tokens:
Given this prevalence, 1 in 1000 levels of weirdness definitely sounds like a mischaracterization to me! It’s also hard for me to see why OpenAI would be incentivized to do 1 in 1000 style cherry-picking in a blog post published in their own blog.
You’re totally correct, I was wrong there; the weird words are very common.
Edit: Although hrm—I guess I am still a bit reluctant to try to incorporate it enormously into my model? Like I’m still getting cherry-picking about which CoTs I see, even if “overshadow” is very common. If I saw a bunch of “overshadow” scattered into a GSM8K problem for instance, it might be more clearly a boring filler, than if I saw in these alignment-specific scenarios. So yeah I agree the weirdness is real; the examples we’re seeing about it are still cherry picked.
why not just filter the CoTs for legibility before training on them?
Yeah I don’t know, I wish I had some view of what was going on. My guess is some level of light filtering + risk aversion to heavy filtering.
Long post trying to address comments from a few different threads:
I’ve seen you and nostalgebrist mention a few times that the fact that Anthropic has HHH legible english CoT’s is one of the primary points convincing you of your views against semantic drift being the default.
My guess is that CoT spilllover/leakage has been a problem in all the Anthropic models and I don’t think the training-on-cot before Sonnet 4.5 (and Opus 4.5) is a more important factor than this. Separately, I’d guess there is a bunch of transfer from earlier models if you init on their reasoning traces. So, my gues is we’ve just never had Ant models that aren’t effectively significantly trained on the CoT?
I’ve seen arguments from 1a3ron and nostalgebrist about why this was unlikely based on their priors (primarily claims about it being a skill issue) and I think it’s worth updating against similar arguments about what the “natural” prior for chain of thought reasoning is. I’m in general skeptical of such arguments since they seem fairly post hoc, but at this point I think there’s just no evidence for “models by default end up using HHH’d legible english like anthropic models”. Meanwhile from other model families we have explicit evidence that this is not the case.
I’d also note that just based on CoT examples from o3 before safety training it seems like safety training (where currently optimization pressure is applied directly to the CoT via SFT) was largely responsible for the increase in repeated spandrels like “disclaim disclaim illusions”, without that, the CoT seems to look much less degenerate, but definitely not HHH’d english: lesswrong.com/posts/4hXWSw8t… (new terminology still emerges, gets new meanings even before safety training alignment.openai.com/metagaming/#re…)
Overall my suspiscion is that by default you get semantic drift at minimum (and compression given everyone uses various forms of length penalties, but the semantic drift is the harder case as far as legibility).
because the information I see is extremely, extremely cherry-picked
I mean we gave a bunch of stats on it and showed comparisons to 2017 era web text as far as overrepresented words. I would strongly disagree with “cherry-picked for weirdness” (and in fact in the paper at least we almost always have randomly selected examples paired with anything cherry-picked to avoid that criticism even for examples).
If I saw a bunch of “overshadow” scattered into a GSM8K problem for instance, it might be more clearly a boring filler
Again, we specifically gave GPQA rates as an example here, and the terminology was higher for GPQA than alignment evals. (I’m assuming GSM8K was just a standin for any capabilities eval)
As for the newer GPT-5 esque models, as we mention in the system card the CoT is much more HHH, but in general I think that should make you suspect that optimization pressure was applied to the CoT.
I think the main reason I’ve been so adamant about pushing back here is because it seems like you’re increasingly discrediting the empirical evidence we do have to try to salvage your views here, and the downstream consequence is that instead of sudden HHH legible english CoTs making you suspiscious that “wait, did the lab apply additional optimization pressure”, in your view it’s explainable as “ah, they finally fixed their skill issue”. I think the latter is just wrong and is largely misleading researchers to the extent they adopt it, and serves to shield labs from criticism. To the extent that I can I’m happy to again pull together more empirical evidence, but I genuinely feel like you’re often working backwards from your prior on this subject.
Can you summarize the pieces of evidence that lead you to think that by default CoT becomes illegible with currently-typical amounts of outcome RL? I’ve read through a lot of your comments, clicked on a bunch of the links in them, and still don’t really understand what you are talking about. Fair warning, I have similar priors as 1a3orn and nostalgebraist.
Please don’t link me to OAI model CoTs, I don’t think I can easily adjudicate those. You can just make claims about OAI model CoTs and within the scope of this conversation I’ll accept it. (I won’t accept claims about the cause without evidence.) But even if OAI model CoTs were fully illegible I wouldn’t conclude that by default CoT becomes illegible; I’d want to see that this was true for other models too.
You seem to think that your view is backed by empirical evidence; so far I haven’t seen it. (Jozdien’s illegibility evaluation was the main piece of evidence I knew of apart from OAI CoTs, but it doesn’t seem to hold up.)
FWIW, my read on the relative legibility of different families’ CoTs is not “Anthropic is uniquely legible” but rather “o3 (and to some extent, OpenAI in general) is uniquely illegible.”
OpenAI’s CoTs are weirder (or more precisely, have a much higher “peak weirdness”) than anyone else’s. Weirdness seems unrelated to capability level when comparing across labs [...]
Several OpenAI models do the marinade/parted/etc. stuff; I have not seen anything comparably unintelligible from any non-OpenAI model
Even if we totally throw out any evidence from Anthropic, there is just nothing out there remotely as weird as the o3 stuff. Every other model whose CoTs I’ve read is much more (seemingly) legible and much closer to ordinary English. This includes various models which are more capable than o3 and/or which likely received much more RL compute than o3, and includes models from many different families (Gemini, Grok, Qwen, DeepSeek[1], GLM, Kimi, StepFun, MiMo, Longcat, arguably later OpenAI models...).
Now, is my view that “models by default end up using HHH’d legible english”? No. My view is that:
there is not much empirical evidence on what can cause CoT to be more or less illegible
I doubt that the legibility of CoT is very strongly constrained by capability pressure at current capability levels, i.e. I expect it is under-determined and can probably go various ways depending on messy details of the training setup and the initialization
it is conceivable that o3 was trained with much less optimization pressure on CoT than any of the (many) other reasoning models available today, and that this explains the gap in legibility between o3 and all those other models—but it would take more evidence than I’ve yet seen to convince me of this, and specifically more evidence about which models actually received optimization pressure on CoT in what ways
like, if o3-esque illegibility is the default, then I would be surprised that no one else has gotten there just by doing some fairly simple recipe and not caring too much about legibility? (that’s basically what R1-Zero was, and IIRC even it produces much more legible CoT than o3)
I know that Jozdien’s paper had results on illegible CoTs from Qwen and DeepSeek models, but… I have used these models a lot, or the DeepSeek ones anyway, and rounding things off to “these and o3 are similarly illegible, as opposed to Claude which is different” would be a dramatic misstatement of what the CoTs are actually like in practice. If you disagree, I encourage you to pick some random GPQA question—or indeed, really any long-CoT-requiring question whatsoever—and give it to R1, QwQ and o3 and see what happens.
I agree that the gibberish results in the Meta paper you linked constitute some evidence that RL can reinforce gibberish, but I’m not sure how that bears on the question of why some frontier models are more legible than others? “They were supervised for legibility” is one hypothesis, yes, but it might just be that that they started off with a “good” initialization that sent RL on a legible trajectory?
You mentioned here that Anthropic CoT’s kept providing evidence for your “maybe this is explained in all cases by skill issue theory”. Which I think people should now update against. I think theories which predicted that this was explained by “good init” should be downweighted relative to ones which predicted optimization pressure on the cot. You could argue that we haven’t done a specific carefully controlled study in this exact Anthropic case, but at that point we’re really bending over backwards to favor the good init hypothesis.
I also think there’s preexisting emprical reasons like: https://arxiv.org/abs/2407.13692 (Prover Verifier Games) specifically to induce a monitorability tax with respect to legibility, to not favor the good init skill issue hypothesis.
Without the good init skill issue hypothesis, this also seems like:
I agree that the gibberish results in the Meta paper you linked constitute some evidence that RL can reinforce gibberish
Should again be evidence that this is to some extent “the default” in an unrelated model family.
I’m somewhat skeptical of that paper’s interpretation of the observations it reports, at least for R1 and R1-Zero.
I’d note that here you just seem to be disagreeing with R1’s original paper. I don’t think you’ve ever shown that they were somehow wrong in their R1-Zero observations, and this seems important R1-Zero given it’s exactly the evidence you’d expect to see if you were wrong here.
R1 CoTs are usually totally legible
Yes but this is after they had to do SFT on human reasoning specifically to address illegible or ambiguously degenerate reasoning seen in R1-Zero. I’ve see you and 1a3orn repeatedly cite R1 (in your personal experience) being legible, the when Arun’s paper provided results with R1 and other models producing illegible cot this was then blamed on model providers, then that also proved insufficient based on the response from Arun here: https://www.lesswrong.com/posts/FG54euEAesRkSZuJN/ryan_greenblatt-s-shortform?commentId=5kCjAKY7srmYmQFom which you seem to acknowledge.
This is importantly on top of the fact that none of it addresses why originally in R1-Zero this was a problem.
Working my way through other models you listed:
Gemini, Grok,
Do you know what optimization pressure was applied to their CoT during midtraining or posttraining? If not, I think it’s incorrect that you’ve been updating towards the “good init skill issue” hypothesis based on observations from them, and they should be in “no evidence either way”.
I’m not all that convinced by my own evidence here… but also, like I said, it’s tough to summarize precisely what that evidence is. It’s not just the fact that the models seem confused by these words when they appear as “input” rather than “CoT,”
I still think this is just super invalid for the OpenAI models, these are separate channels and even if they weren’t it’s unclear a priori what to make of this, but your objections here do still seem to weight them heavily in practice.
Note that it wrote out the same exact 14-word phrase 14 times before finally escaping the loop. If the textual content here is serving any computation role at all, this surely must be, at best, a very inefficient way to accomplish whatever it is that it’s accomplishing (?).
or in the comment in this thread:
that’s basically what R1-Zero was, and IIRC even it produces much more legible CoT than o3
A significant update I think both of you should make is that these degenerate repititions in OpenAI models seem almost entirely due to safety training applied after capabilities training. From capabilities training alone, you do not see these extreme repeated spandrels, but you do see signifiant semantic drift over time. This is a new piece of information post original paper (which we also didn’t know at the time of the paper).
I have not seen either of you update on this, as evidence by both in this thread seeming to still argue against the specifical spandrel dense illegibility in the final production o3. My expectation here is that you’re still overindexing on this, and instead of updating that capabilities training via outcome based RL results in significant semantic drift, you’re focusing on your own intuitions based on the later spandrel heavy CoT, which I think is incorrect.
In addition to the quantitative analysis in that post, there’s concrete full transcript CoT examples of “o3 before degerate spandrel spam” if you want to get a qualitative sense here.
there’s a thin line between “useful but opaque” and “degenerate looping” (with frequent switches back and forth between these two modes)… this all feels to me more like “the contingent outcome of a janky / under-constrained training process” than “a glimps at the essential properties of highly capable CoT.”
Again to separate out actual claims here:
the terminology is absolutely useful but opaque for some terms and usage changes over capabilities training, like the meaning, in a non degenerate, non spandrels heavy way, this is just empirically true, and I think it’d be useful to look at the figure here: https://alignment.openai.com/metagaming#ref-11 - the counterclaim would be models don’thave semantic drift under outcome based RL which seems incredibly strong and empirically false
the argument has never been that these are unmitigable (i.e. essentially) properties of highly capable CoT, in fact Arun’s paper mentions paraphrasing illegible sections in some cases recovers performance, the disagreement is about whether you get this by default absent specific countermeasures
The main thing that seems to me to have no empirical evidence is that under outcome based RL, without specific interventions to address it, you will not get significant semantic drift. I see no empirical evidence for this, it doesn’t seem to fit with pre-reasoning model expectations, (far less strongly) individually humans in their own personal notes and human language in general when geographically isolated has serious semantic drift.
Notably “significant semantic drift” is specifically not an argument for the defaultness of (1) repeated degenerate spandrels (2) a “new alien language”, but it does matter, because it harms monitorability for misalignment and human legibility of the CoT.
tldr:
there is not much empirical evidence on what can cause CoT to be more or less illegible
My overall view currently is:
final o3 is particularly degenerate agree
but you do in fact absolutely get loss of human monitorability for misalignment via semantic drift by default with plenty of empirical evidence we are increasingly seeing this
there are probably cheap mitigations you could do without major perf hit if labs were disciplined about this (but they probably aren’t and won’t be)
unclear what the faithfulness / obfuscation hit of those mitigations would be (or that labs would make sure they could measure faithfulness / obfuscation impact before applying mitigations)
Note: The main reasons I’ve been doing long writeups on this is I do think you have good and deep intuitions about models, and to the extent I’m wrong about things here I’ve been trying to at least make it legible externally why I’m thinking about things this way.
I have not seen either of you update on this, as evidence by both in this thread seeming to still argue against the specifical spandrel dense illegibility in the final production o3. My expectation here is that you’re still overindexing on this, and instead of updating that capabilities training via outcome based RL results in significant semantic drift, you’re focusing on your own intuitions based on the later spandrel heavy CoT, which I think is incorrect.
Ah, to clarify, when I talk about “o3-level illegibility” I am including the CoT samples shared in your metagaming paper in that bucket. They lack the degenerate looping behavior but still seem substantially less legible in other respects than CoTs from any other model I’ve worked with.
The observation in your paper is interesting but I don’t see how it implies updates to the views I’ve already stated; I said that the degenerate looping behavior looked non-functional, and indeed we agree that it probably is (and was not induced by RL), and meanwhile I still see a big legibility gap between pre-production o3 and other models I’ve observed, including R1-Zero[1].
That issue aside, I’m not sure how to resolve the remaining disagreement, or even what it actually is (i.e. what the cruxes are).
Largely because I don’t understand what predictions your view makes—or, equivalently, what hypothetical observations it rules out. I don’t have a clear sense of how your approach to the empirical evidence differs from the following (obviously problematic) rule:
If we observe that some model, M, produces illegible CoT --> that’s evidence for your view because you think this happens without countermeasures
If we instead observe that the same model, M, produces legible CoT --> well, they must have been using countermeasures (or we don’t know if they were, which is typically true), so this is at worst neutral for your view
(And thus, no possible set of observations about the CoTs of particular models could constitute counter-evidence to your view; or, equivalently, your view makes no predictions about the CoTs of particular models. I’m not saying that this is the case, necessarily, just asking for clarification about why it’s not the case, if it’s not)
For example, you have interpreted R1 in both of these two ways: first you said (citing Arun’s paper) that it produced illegible CoT and that this was evidence for your view over mine, and then when I said that R1 CoTs were legible, you said that didn’t matter because countermeasures were applied. So it seems like R1′s legibility doesn’t actually matter one way or another?
You have also been explaining many different legible-CoT observations across many model families by saying: well, they might have applied optimization pressure to their CoT during midtraining or posttraining, we don’t have any way to know. Which is true.
But as you’ve noted, OpenAI did apply such pressure to the production o3 and it actually made it somewhat less legible (to the surprise of both of us).
And meanwhile, all the other labs seem to be converging on substantially more legible CoT than pre-production o3. If none of that matters for your view—because it can all be explained away by invoking the possibility of unobserved and unobservable countermeasures (all of them more effective, apparently, than that one o3 countermeasure which we do know a bit more about[2]) -- then, like… what are we even doing, here? What’s the point of talking about all these specific observations if you would feel equally comfortable sticking with your prior position no matter what way those observations turn out?
(To clarify, I don’t think “no, it actually doesn’t matter to me how these observations turn out” is necessarily an invalid position to take. It is possible that we are just reading patterns into tea leaves here, that all of this evidence is too messy and partial and confounded to be worth considering—if that’s the case then we should notice it’s the case and stop looking at the tea leaves entirely. But that would mean actually doing that, i.e. agreeing that we simply have very little direct evidence about the matter, and dropping the dispute about who ought to update in which ways based on particular details about Anthropic’s mistakes, or about whether R1 is legible, or whatever.)
Incidentally, if you trust Arun’s paper, his measured illegibility rate for R1-Zero is not actually much higher than for R1! In his Table 2, the point estimates for R1-Zero and R1 are 4.59 and 4.30 respectively, as compared to e.g. 3.78 for QwQ and ~3.3 for the R1 distills. I am still pretty skeptical of the R1 result in this paper, but unless you share my skepticism, this ought to affect your view of how much the R1 countermeasures mattered.
And indeed, it’s not at all obvious to me that “accidentally sending CoT to the RL grader in some % of episodes with no instructions about what to do with it” (what Anthropic did to several recent models) would be a more effective legibility intervention than “doing SFT on CoT at the end” (what OpenAI did to production o3). Since you were surprised by the direction of the effect in o3′s case, I don’t think you should be so sure about the direction in Anthropic’s case either.
You have also been explaining many different legible-CoT observations across many model families by saying: well, they might have applied optimization pressure to their CoT during midtraining or posttraining, we don’t have any way to know. Which is true.
But as you’ve noted, OpenAI did apply such pressure to the production o3 and it actually made it somewhat less legible (to the surprise of both of us).
You’re mixing several notions of legibility and optimization pressure. OpenAI specifically applied SFT as part of Deliberative Alignment during safety training in a way that seems to have increased repeated spandrels.
I said that the degenerate looping behavior looked non-functional
The degenerate repeats themselves yes, but I think you’re empirically wrong that the illegible text itself is non-functional. I don’t know what to do with disagreement on this point as right now it’s emperical results vs your personal private experiments with models.
For example, you have interpreted R1 in both of these two ways: first you said (citing Arun’s paper) that it produced illegible CoT and that this was evidence for your view over mine, and then when I said that R1 CoTs were legible, you said that didn’t matter because countermeasures were applied. So it seems like R1′s legibility doesn’t actually matter one way or another?
Again you’re being technically imprecise. R1-Zero explicitly was noted to have gibberish, repitions, mixed language, etc to the point where they introduced addition SFT warm start and other format and language penalties to get R1. R1 still has some remaining illegibility as shown in Arun’s paper. (Note that this also isn’t an argument about some particular magnitude of effect)
And indeed, it’s not at all obvious to me that “accidentally sending CoT to the RL grader in some % of episodes with no instructions about what to do with it” (what Anthropic did to several recent models) would be a more effective legibility intervention than “doing SFT on CoT at the end” (what OpenAI did to production o3). Since you were surprised by the direction of the effect in o3′s case, I don’t think you should be so sure about the direction in Anthropic’s case either.
Again, you’re confused on this and mixing up several sources of optimization pressure and their effects. Here you’re being quite subtle about it, mixing “me being surprised that their SFT induced increased degredation in the form of repeated loops” and “RL applied directly to the CoT with HHH pressure causing obfuscation“ (which is well established).
I’m willing to just give up on this exchange, since each time I’ve tried to make a pretty detailed series of specific points and I haven’t found you or 1a3orn willing to engage with them, instead continually conflating specific points into some charactured view.
Want to echo Nost that, to the best of my knowledge, I’ve never said “models by default end up using HHH’d legible english like anthropic models.” I’ve never said anything about it being harmless; nor anything about it not metagaming; etc.
I endorse his first two bullet points below, particularly the second, as a close-enough summary of what I believe.
I want to add to the second that continued CoT legibility probably also depends on how you do midtraining, because there’s likely a (midtraining → RL) cycle that gets repeated a few times; I don’t think there’s ever going to pure a natural language + GRPO / PPO LLM ever again. And of course you can influence what kind of reasoning shows up there—see like, Doria’s work.
As an addition—I wouldn’t be surprised if models learn to embellish words with slightly more formal meanings—i.e., “Watchers” for the grader—but like, it’s important to note that this level of “new language” falls short of what can be done by bored 9 year olds in five minutes. New jargon, even new terminology, doesn’t make a new language!
As far as I can tell the only two causally independent pieces of evidence for weird, actually new-language like CoTs look like:
Jozdien’s paper
o3 / GPT-5.n CoTs, which are all tightly causally entangled.
Jozdien I haven’t reproduced. Like—the above is like 1% of some things I did partly to reproduce his unintelligible results. And after looking at like, 100s of CoTs from models like Ling, Minimax, DS, and so on, I just haven’t gotten unintelligible / random seeming results in a way that “felt natural” to me. So… maybe I’m doing it wrong? But he thinks they look like spandrells anyway, as far as I know, so this means that regardless it’s the kind of thing you can just drop with a better training mechanism.
And that just leaves o3 as the new-language thing, from which I’ve already updated—I think you’re frustrated because I’m not doing separate updates for o3, then for GPT-5, then for something related to GPT-5, or something? But that’s double-counting evidence.
If you have some evidence that doesn’t fall into this schema I’d be very happy to know what it is, and of course to update from it. Sorry the links you provide are a bit tangled, and I might have missed something.
I think the latter is just wrong and is largely misleading researchers to the extent they adopt it, and serves to shield labs from criticism
Neither here nor there, but I’m really just interested in what’s the case here, and not in downstream effects. IMO this attitude (“will my results be politically useful to labs?”) has hurt a lot of technical work.
And that just leaves o3 as the new-language thing, from which I’ve already updated
Again, literally no one is arguing for an ex nihilo new language. The argument is repeatedly about whether you get semantic drift “by default”.
I wouldn’t be surprised if models learn to embellish words with slightly more formal meanings
I’m not clear if you haven’t read the post in this case, but “embellish words with slightly more formal meanings” is just absolutely, empirically not what is happening. The linked post goes into this for multiple terms. (The content is in the linked section of that post)
And after looking at like, 100s of CoTs from models like Ling, Minimax, DS, and so on, I just haven’t gotten unintelligible / random seeming results in a way that “felt natural” to me. So… maybe I’m doing it wrong?
To be clear, a lot of nost’s arguments had a similar “I looked at a bunch of stuff” flavor. I’m pretty surprised that a lot of the points in the paper (for example the unusual terminology being extremely statistically common including on capabilities benchmarks) seem new to you, as you consistently have expressed strong opinions about this so I assumed you had arguments of why this fit your model.
But he thinks they look like spandrells anyway, as far as I know, so this means that regardless it’s the kind of thing you can just drop with a better training mechanism.
Okay but “phenomenon happens“ and “the vibe I get is that you could fix this with a better training mechanism” are extremely different claims.
Neither here nor there, but I’m really just interested in what’s the case here, and not in downstream effects. IMO this attitude (“will my results be politically useful to labs?”) has hurt a lot of technical work.
Yeah my point here was I haven’t seen any updating on empirical evidence. I assume this is because you just have very strong priors of your own on this, but it isn’t clear to me what emperical evidence you’d update on at this point. What would you expect to see if you were wrong here?
This makes sense! I agree with points 1-3. Point 4 sounds plausible, but why not just filter the CoTs for legibility before training on them? This seems like an obvious thing to do; see here for one specific proposal by Daniel Kokotajlo. It’s possible that OpenAI indeed did this, but since almost all transcripts produced by o3 contain some weirdness, it wasn’t possible to get a perfectly clean dataset and some of o3′s reasoning patterns were transmitted to 5.n models—we don’t know the quantitative differences in CoT weirdness between o3 and 5.n models.
For point 5, I strongly disagree that we’re seeing 1 in 1000 levels of cherry-picking for weirdness. The anti-scheming paper shows that the weird words are more common in GPQA transcripts than in scheming evals, with “overshadow” appearing up to 60k times more frequently than in webtext:
The metagaming blog post shows that in alignment evals, the term “Redwood” appears up to 0.35 times in 1000 reasoning tokens:
Given this prevalence, 1 in 1000 levels of weirdness definitely sounds like a mischaracterization to me! It’s also hard for me to see why OpenAI would be incentivized to do 1 in 1000 style cherry-picking in a blog post published in their own blog.
You’re totally correct, I was wrong there; the weird words are very common.
Edit: Although hrm—I guess I am still a bit reluctant to try to incorporate it enormously into my model? Like I’m still getting cherry-picking about which CoTs I see, even if “overshadow” is very common. If I saw a bunch of “overshadow” scattered into a GSM8K problem for instance, it might be more clearly a boring filler, than if I saw in these alignment-specific scenarios. So yeah I agree the weirdness is real; the examples we’re seeing about it are still cherry picked.
Yeah I don’t know, I wish I had some view of what was going on. My guess is some level of light filtering + risk aversion to heavy filtering.
Long post trying to address comments from a few different threads:
I’ve seen you and nostalgebrist mention a few times that the fact that Anthropic has HHH legible english CoT’s is one of the primary points convincing you of your views against semantic drift being the default.
I’ve been arguing for a while now that Anthropic CoT’s clearly have significant optimization pressure applied (https://www.lesswrong.com/posts/FG54euEAesRkSZuJN/ryan_greenblatt-s-shortform?commentId=cfjCYn4uguopdexkJ)
Ryan disagreed with my claim originally but seems to have since updated: https://www.lesswrong.com/posts/RDG2dbg6cLNyo6MYT/thane-ruthenis-s-shortform?commentId=LDyYDuMCHtj5NcjKo
I’ve seen arguments from 1a3ron and nostalgebrist about why this was unlikely based on their priors (primarily claims about it being a skill issue) and I think it’s worth updating against similar arguments about what the “natural” prior for chain of thought reasoning is. I’m in general skeptical of such arguments since they seem fairly post hoc, but at this point I think there’s just no evidence for “models by default end up using HHH’d legible english like anthropic models”. Meanwhile from other model families we have explicit evidence that this is not the case.
As for the spandrels specifically, repeating from here since it was both surprising and a meaningful update to me: https://x.com/bronsonschoen/status/2042053159816704212?s=46&t=4aLPwuA9FyTHP5OGEphaxA
Overall my suspiscion is that by default you get semantic drift at minimum (and compression given everyone uses various forms of length penalties, but the semantic drift is the harder case as far as legibility).
I mean we gave a bunch of stats on it and showed comparisons to 2017 era web text as far as overrepresented words. I would strongly disagree with “cherry-picked for weirdness” (and in fact in the paper at least we almost always have randomly selected examples paired with anything cherry-picked to avoid that criticism even for examples).
Again, we specifically gave GPQA rates as an example here, and the terminology was higher for GPQA than alignment evals. (I’m assuming GSM8K was just a standin for any capabilities eval)
As for the newer GPT-5 esque models, as we mention in the system card the CoT is much more HHH, but in general I think that should make you suspect that optimization pressure was applied to the CoT.
I think the main reason I’ve been so adamant about pushing back here is because it seems like you’re increasingly discrediting the empirical evidence we do have to try to salvage your views here, and the downstream consequence is that instead of sudden HHH legible english CoTs making you suspiscious that “wait, did the lab apply additional optimization pressure”, in your view it’s explainable as “ah, they finally fixed their skill issue”. I think the latter is just wrong and is largely misleading researchers to the extent they adopt it, and serves to shield labs from criticism. To the extent that I can I’m happy to again pull together more empirical evidence, but I genuinely feel like you’re often working backwards from your prior on this subject.
Can you summarize the pieces of evidence that lead you to think that by default CoT becomes illegible with currently-typical amounts of outcome RL? I’ve read through a lot of your comments, clicked on a bunch of the links in them, and still don’t really understand what you are talking about. Fair warning, I have similar priors as 1a3orn and nostalgebraist.
Please don’t link me to OAI model CoTs, I don’t think I can easily adjudicate those. You can just make claims about OAI model CoTs and within the scope of this conversation I’ll accept it. (I won’t accept claims about the cause without evidence.) But even if OAI model CoTs were fully illegible I wouldn’t conclude that by default CoT becomes illegible; I’d want to see that this was true for other models too.
You seem to think that your view is backed by empirical evidence; so far I haven’t seen it. (Jozdien’s illegibility evaluation was the main piece of evidence I knew of apart from OAI CoTs, but it doesn’t seem to hold up.)
FWIW, my read on the relative legibility of different families’ CoTs is not “Anthropic is uniquely legible” but rather “o3 (and to some extent, OpenAI in general) is uniquely illegible.”
As I put it here:
Even if we totally throw out any evidence from Anthropic, there is just nothing out there remotely as weird as the o3 stuff. Every other model whose CoTs I’ve read is much more (seemingly) legible and much closer to ordinary English. This includes various models which are more capable than o3 and/or which likely received much more RL compute than o3, and includes models from many different families (Gemini, Grok, Qwen, DeepSeek[1], GLM, Kimi, StepFun, MiMo, Longcat, arguably later OpenAI models...).
Now, is my view that “models by default end up using HHH’d legible english”? No. My view is that:
there is not much empirical evidence on what can cause CoT to be more or less illegible
I doubt that the legibility of CoT is very strongly constrained by capability pressure at current capability levels, i.e. I expect it is under-determined and can probably go various ways depending on messy details of the training setup and the initialization
it is conceivable that o3 was trained with much less optimization pressure on CoT than any of the (many) other reasoning models available today, and that this explains the gap in legibility between o3 and all those other models—but it would take more evidence than I’ve yet seen to convince me of this, and specifically more evidence about which models actually received optimization pressure on CoT in what ways
like, if o3-esque illegibility is the default, then I would be surprised that no one else has gotten there just by doing some fairly simple recipe and not caring too much about legibility? (that’s basically what R1-Zero was, and IIRC even it produces much more legible CoT than o3)
I know that Jozdien’s paper had results on illegible CoTs from Qwen and DeepSeek models, but… I have used these models a lot, or the DeepSeek ones anyway, and rounding things off to “these and o3 are similarly illegible, as opposed to Claude which is different” would be a dramatic misstatement of what the CoTs are actually like in practice. If you disagree, I encourage you to pick some random GPQA question—or indeed, really any long-CoT-requiring question whatsoever—and give it to R1, QwQ and o3 and see what happens.
You mentioned here that Anthropic CoT’s kept providing evidence for your “maybe this is explained in all cases by skill issue theory”. Which I think people should now update against. I think theories which predicted that this was explained by “good init” should be downweighted relative to ones which predicted optimization pressure on the cot. You could argue that we haven’t done a specific carefully controlled study in this exact Anthropic case, but at that point we’re really bending over backwards to favor the good init hypothesis.
I also think there’s preexisting emprical reasons like: https://arxiv.org/abs/2407.13692 (Prover Verifier Games) specifically to induce a monitorability tax with respect to legibility, to not favor the good init skill issue hypothesis.
Without the good init skill issue hypothesis, this also seems like:
Should again be evidence that this is to some extent “the default” in an unrelated model family.
I’d note that here you just seem to be disagreeing with R1’s original paper. I don’t think you’ve ever shown that they were somehow wrong in their R1-Zero observations, and this seems important R1-Zero given it’s exactly the evidence you’d expect to see if you were wrong here.
Yes but this is after they had to do SFT on human reasoning specifically to address illegible or ambiguously degenerate reasoning seen in R1-Zero. I’ve see you and 1a3orn repeatedly cite R1 (in your personal experience) being legible, the when Arun’s paper provided results with R1 and other models producing illegible cot this was then blamed on model providers, then that also proved insufficient based on the response from Arun here: https://www.lesswrong.com/posts/FG54euEAesRkSZuJN/ryan_greenblatt-s-shortform?commentId=5kCjAKY7srmYmQFom which you seem to acknowledge.
This is importantly on top of the fact that none of it addresses why originally in R1-Zero this was a problem.
Working my way through other models you listed:
Do you know what optimization pressure was applied to their CoT during midtraining or posttraining? If not, I think it’s incorrect that you’ve been updating towards the “good init skill issue” hypothesis based on observations from them, and they should be in “no evidence either way”.
I still think this is just super invalid for the OpenAI models, these are separate channels and even if they weren’t it’s unclear a priori what to make of this, but your objections here do still seem to weight them heavily in practice.
or in the comment in this thread:
A significant update I think both of you should make is that these degenerate repititions in OpenAI models seem almost entirely due to safety training applied after capabilities training. From capabilities training alone, you do not see these extreme repeated spandrels, but you do see signifiant semantic drift over time. This is a new piece of information post original paper (which we also didn’t know at the time of the paper).
I have not seen either of you update on this, as evidence by both in this thread seeming to still argue against the specifical spandrel dense illegibility in the final production o3. My expectation here is that you’re still overindexing on this, and instead of updating that capabilities training via outcome based RL results in significant semantic drift, you’re focusing on your own intuitions based on the later spandrel heavy CoT, which I think is incorrect.
In addition to the quantitative analysis in that post, there’s concrete full transcript CoT examples of “o3 before degerate spandrel spam” if you want to get a qualitative sense here.
Again to separate out actual claims here:
the terminology is absolutely useful but opaque for some terms and usage changes over capabilities training, like the meaning, in a non degenerate, non spandrels heavy way, this is just empirically true, and I think it’d be useful to look at the figure here: https://alignment.openai.com/metagaming#ref-11 - the counterclaim would be models don’thave semantic drift under outcome based RL which seems incredibly strong and empirically false
the argument has never been that these are unmitigable (i.e. essentially) properties of highly capable CoT, in fact Arun’s paper mentions paraphrasing illegible sections in some cases recovers performance, the disagreement is about whether you get this by default absent specific countermeasures
The main thing that seems to me to have no empirical evidence is that under outcome based RL, without specific interventions to address it, you will not get significant semantic drift. I see no empirical evidence for this, it doesn’t seem to fit with pre-reasoning model expectations, (far less strongly) individually humans in their own personal notes and human language in general when geographically isolated has serious semantic drift.
Notably “significant semantic drift” is specifically not an argument for the defaultness of (1) repeated degenerate spandrels (2) a “new alien language”, but it does matter, because it harms monitorability for misalignment and human legibility of the CoT.
tldr:
My overall view currently is:
final o3 is particularly degenerate agree
but you do in fact absolutely get loss of human monitorability for misalignment via semantic drift by default with plenty of empirical evidence we are increasingly seeing this
there are probably cheap mitigations you could do without major perf hit if labs were disciplined about this (but they probably aren’t and won’t be)
unclear what the faithfulness / obfuscation hit of those mitigations would be (or that labs would make sure they could measure faithfulness / obfuscation impact before applying mitigations)
Note: The main reasons I’ve been doing long writeups on this is I do think you have good and deep intuitions about models, and to the extent I’m wrong about things here I’ve been trying to at least make it legible externally why I’m thinking about things this way.
Ah, to clarify, when I talk about “o3-level illegibility” I am including the CoT samples shared in your metagaming paper in that bucket. They lack the degenerate looping behavior but still seem substantially less legible in other respects than CoTs from any other model I’ve worked with.
The observation in your paper is interesting but I don’t see how it implies updates to the views I’ve already stated; I said that the degenerate looping behavior looked non-functional, and indeed we agree that it probably is (and was not induced by RL), and meanwhile I still see a big legibility gap between pre-production o3 and other models I’ve observed, including R1-Zero[1].
That issue aside, I’m not sure how to resolve the remaining disagreement, or even what it actually is (i.e. what the cruxes are).
Largely because I don’t understand what predictions your view makes—or, equivalently, what hypothetical observations it rules out. I don’t have a clear sense of how your approach to the empirical evidence differs from the following (obviously problematic) rule:
If we observe that some model, M, produces illegible CoT --> that’s evidence for your view because you think this happens without countermeasures
If we instead observe that the same model, M, produces legible CoT --> well, they must have been using countermeasures (or we don’t know if they were, which is typically true), so this is at worst neutral for your view
(And thus, no possible set of observations about the CoTs of particular models could constitute counter-evidence to your view; or, equivalently, your view makes no predictions about the CoTs of particular models. I’m not saying that this is the case, necessarily, just asking for clarification about why it’s not the case, if it’s not)
For example, you have interpreted R1 in both of these two ways: first you said (citing Arun’s paper) that it produced illegible CoT and that this was evidence for your view over mine, and then when I said that R1 CoTs were legible, you said that didn’t matter because countermeasures were applied. So it seems like R1′s legibility doesn’t actually matter one way or another?
You have also been explaining many different legible-CoT observations across many model families by saying: well, they might have applied optimization pressure to their CoT during midtraining or posttraining, we don’t have any way to know. Which is true.
But as you’ve noted, OpenAI did apply such pressure to the production o3 and it actually made it somewhat less legible (to the surprise of both of us).
And meanwhile, all the other labs seem to be converging on substantially more legible CoT than pre-production o3. If none of that matters for your view—because it can all be explained away by invoking the possibility of unobserved and unobservable countermeasures (all of them more effective, apparently, than that one o3 countermeasure which we do know a bit more about[2]) -- then, like… what are we even doing, here? What’s the point of talking about all these specific observations if you would feel equally comfortable sticking with your prior position no matter what way those observations turn out?
(To clarify, I don’t think “no, it actually doesn’t matter to me how these observations turn out” is necessarily an invalid position to take. It is possible that we are just reading patterns into tea leaves here, that all of this evidence is too messy and partial and confounded to be worth considering—if that’s the case then we should notice it’s the case and stop looking at the tea leaves entirely. But that would mean actually doing that, i.e. agreeing that we simply have very little direct evidence about the matter, and dropping the dispute about who ought to update in which ways based on particular details about Anthropic’s mistakes, or about whether R1 is legible, or whatever.)
Incidentally, if you trust Arun’s paper, his measured illegibility rate for R1-Zero is not actually much higher than for R1! In his Table 2, the point estimates for R1-Zero and R1 are 4.59 and 4.30 respectively, as compared to e.g. 3.78 for QwQ and ~3.3 for the R1 distills. I am still pretty skeptical of the R1 result in this paper, but unless you share my skepticism, this ought to affect your view of how much the R1 countermeasures mattered.
And indeed, it’s not at all obvious to me that “accidentally sending CoT to the RL grader in some % of episodes with no instructions about what to do with it” (what Anthropic did to several recent models) would be a more effective legibility intervention than “doing SFT on CoT at the end” (what OpenAI did to production o3). Since you were surprised by the direction of the effect in o3′s case, I don’t think you should be so sure about the direction in Anthropic’s case either.
You’re mixing several notions of legibility and optimization pressure. OpenAI specifically applied SFT as part of Deliberative Alignment during safety training in a way that seems to have increased repeated spandrels.
The degenerate repeats themselves yes, but I think you’re empirically wrong that the illegible text itself is non-functional. I don’t know what to do with disagreement on this point as right now it’s emperical results vs your personal private experiments with models.
Again you’re being technically imprecise. R1-Zero explicitly was noted to have gibberish, repitions, mixed language, etc to the point where they introduced addition SFT warm start and other format and language penalties to get R1. R1 still has some remaining illegibility as shown in Arun’s paper. (Note that this also isn’t an argument about some particular magnitude of effect)
Again, you’re confused on this and mixing up several sources of optimization pressure and their effects. Here you’re being quite subtle about it, mixing “me being surprised that their SFT induced increased degredation in the form of repeated loops” and “RL applied directly to the CoT with HHH pressure causing obfuscation“ (which is well established).
I’m willing to just give up on this exchange, since each time I’ve tried to make a pretty detailed series of specific points and I haven’t found you or 1a3orn willing to engage with them, instead continually conflating specific points into some charactured view.
I agree, it no longer seems worth the time/effort.
If you’re interested, I just put up a new post in which I re-ran Jozdien’s experiment using a different R1 provider and got both much less illegibility and much higher task accuracy.
Want to echo Nost that, to the best of my knowledge, I’ve never said “models by default end up using HHH’d legible english like anthropic models.” I’ve never said anything about it being harmless; nor anything about it not metagaming; etc.
I endorse his first two bullet points below, particularly the second, as a close-enough summary of what I believe.
I want to add to the second that continued CoT legibility probably also depends on how you do midtraining, because there’s likely a (midtraining → RL) cycle that gets repeated a few times; I don’t think there’s ever going to pure a natural language + GRPO / PPO LLM ever again. And of course you can influence what kind of reasoning shows up there—see like, Doria’s work.
As an addition—I wouldn’t be surprised if models learn to embellish words with slightly more formal meanings—i.e., “Watchers” for the grader—but like, it’s important to note that this level of “new language” falls short of what can be done by bored 9 year olds in five minutes. New jargon, even new terminology, doesn’t make a new language!
As far as I can tell the only two causally independent pieces of evidence for weird, actually new-language like CoTs look like:
Jozdien’s paper
o3 / GPT-5.n CoTs, which are all tightly causally entangled.
Jozdien I haven’t reproduced. Like—the above is like 1% of some things I did partly to reproduce his unintelligible results. And after looking at like, 100s of CoTs from models like Ling, Minimax, DS, and so on, I just haven’t gotten unintelligible / random seeming results in a way that “felt natural” to me. So… maybe I’m doing it wrong? But he thinks they look like spandrells anyway, as far as I know, so this means that regardless it’s the kind of thing you can just drop with a better training mechanism.
And that just leaves o3 as the new-language thing, from which I’ve already updated—I think you’re frustrated because I’m not doing separate updates for o3, then for GPT-5, then for something related to GPT-5, or something? But that’s double-counting evidence.
If you have some evidence that doesn’t fall into this schema I’d be very happy to know what it is, and of course to update from it. Sorry the links you provide are a bit tangled, and I might have missed something.
Neither here nor there, but I’m really just interested in what’s the case here, and not in downstream effects. IMO this attitude (“will my results be politically useful to labs?”) has hurt a lot of technical work.
Again, literally no one is arguing for an ex nihilo new language. The argument is repeatedly about whether you get semantic drift “by default”.
I’m not clear if you haven’t read the post in this case, but “embellish words with slightly more formal meanings” is just absolutely, empirically not what is happening. The linked post goes into this for multiple terms. (The content is in the linked section of that post)
To be clear, a lot of nost’s arguments had a similar “I looked at a bunch of stuff” flavor. I’m pretty surprised that a lot of the points in the paper (for example the unusual terminology being extremely statistically common including on capabilities benchmarks) seem new to you, as you consistently have expressed strong opinions about this so I assumed you had arguments of why this fit your model.
Okay but “phenomenon happens“ and “the vibe I get is that you could fix this with a better training mechanism” are extremely different claims.
Yeah my point here was I haven’t seen any updating on empirical evidence. I assume this is because you just have very strong priors of your own on this, but it isn’t clear to me what emperical evidence you’d update on at this point. What would you expect to see if you were wrong here?