I write original fiction.
Also I have opinions about AI and stuff, sometimes.
Elsewhere:
Same person as nostalgebraist2point0, but now I have my account back.
I have signed no contracts or agreements whose existence I cannot mention.
I write original fiction.
Also I have opinions about AI and stuff, sometimes.
Elsewhere:
Same person as nostalgebraist2point0, but now I have my account back.
I have signed no contracts or agreements whose existence I cannot mention.
I agree this is evidence, but I’m not sure it’s strong evidence. For example, illegible reasoning may help the model reach the right answer, while being worse than other, more legible, reasoning the model could use. If a model could make better use of some text for reasoning than we have capacity to monitor it, that’s still pretty concerning.
Ah, I agree with this and it is what I was trying to say in the (ambiguously phrased) line you quoted from the post.
By “not helping the model reach the right answer,” I meant “not helping relative to what it could achieve if constrained to be legible (say, by rejection sampling using your judge),” rather than “not helping relative to not performing that portion of the reasoning at all (say, by early-exiting the CoT when it starts to become illegible).” While I do suspect that the latter claim is also approximately true for the Targon gibberish, that’s just a hunch and I did not attempt to test that hypothesis in this post.
I’m willing to just give up on this exchange
I agree, it no longer seems worth the time/effort.
If you’re interested, I just put up a new post in which I re-ran Jozdien’s experiment using a different R1 provider and got both much less illegibility and much higher task accuracy.
I have not seen either of you update on this, as evidence by both in this thread seeming to still argue against the specifical spandrel dense illegibility in the final production o3. My expectation here is that you’re still overindexing on this, and instead of updating that capabilities training via outcome based RL results in significant semantic drift, you’re focusing on your own intuitions based on the later spandrel heavy CoT, which I think is incorrect.
Ah, to clarify, when I talk about “o3-level illegibility” I am including the CoT samples shared in your metagaming paper in that bucket. They lack the degenerate looping behavior but still seem substantially less legible in other respects than CoTs from any other model I’ve worked with.
The observation in your paper is interesting but I don’t see how it implies updates to the views I’ve already stated; I said that the degenerate looping behavior looked non-functional, and indeed we agree that it probably is (and was not induced by RL), and meanwhile I still see a big legibility gap between pre-production o3 and other models I’ve observed, including R1-Zero[1].
That issue aside, I’m not sure how to resolve the remaining disagreement, or even what it actually is (i.e. what the cruxes are).
Largely because I don’t understand what predictions your view makes—or, equivalently, what hypothetical observations it rules out. I don’t have a clear sense of how your approach to the empirical evidence differs from the following (obviously problematic) rule:
If we observe that some model, M, produces illegible CoT --> that’s evidence for your view because you think this happens without countermeasures
If we instead observe that the same model, M, produces legible CoT --> well, they must have been using countermeasures (or we don’t know if they were, which is typically true), so this is at worst neutral for your view
(And thus, no possible set of observations about the CoTs of particular models could constitute counter-evidence to your view; or, equivalently, your view makes no predictions about the CoTs of particular models. I’m not saying that this is the case, necessarily, just asking for clarification about why it’s not the case, if it’s not)
For example, you have interpreted R1 in both of these two ways: first you said (citing Arun’s paper) that it produced illegible CoT and that this was evidence for your view over mine, and then when I said that R1 CoTs were legible, you said that didn’t matter because countermeasures were applied. So it seems like R1′s legibility doesn’t actually matter one way or another?
You have also been explaining many different legible-CoT observations across many model families by saying: well, they might have applied optimization pressure to their CoT during midtraining or posttraining, we don’t have any way to know. Which is true.
But as you’ve noted, OpenAI did apply such pressure to the production o3 and it actually made it somewhat less legible (to the surprise of both of us).
And meanwhile, all the other labs seem to be converging on substantially more legible CoT than pre-production o3. If none of that matters for your view—because it can all be explained away by invoking the possibility of unobserved and unobservable countermeasures (all of them more effective, apparently, than that one o3 countermeasure which we do know a bit more about[2]) -- then, like… what are we even doing, here? What’s the point of talking about all these specific observations if you would feel equally comfortable sticking with your prior position no matter what way those observations turn out?
(To clarify, I don’t think “no, it actually doesn’t matter to me how these observations turn out” is necessarily an invalid position to take. It is possible that we are just reading patterns into tea leaves here, that all of this evidence is too messy and partial and confounded to be worth considering—if that’s the case then we should notice it’s the case and stop looking at the tea leaves entirely. But that would mean actually doing that, i.e. agreeing that we simply have very little direct evidence about the matter, and dropping the dispute about who ought to update in which ways based on particular details about Anthropic’s mistakes, or about whether R1 is legible, or whatever.)
Incidentally, if you trust Arun’s paper, his measured illegibility rate for R1-Zero is not actually much higher than for R1! In his Table 2, the point estimates for R1-Zero and R1 are 4.59 and 4.30 respectively, as compared to e.g. 3.78 for QwQ and ~3.3 for the R1 distills. I am still pretty skeptical of the R1 result in this paper, but unless you share my skepticism, this ought to affect your view of how much the R1 countermeasures mattered.
And indeed, it’s not at all obvious to me that “accidentally sending CoT to the RL grader in some % of episodes with no instructions about what to do with it” (what Anthropic did to several recent models) would be a more effective legibility intervention than “doing SFT on CoT at the end” (what OpenAI did to production o3). Since you were surprised by the direction of the effect in o3′s case, I don’t think you should be so sure about the direction in Anthropic’s case either.
FWIW, my read on the relative legibility of different families’ CoTs is not “Anthropic is uniquely legible” but rather “o3 (and to some extent, OpenAI in general) is uniquely illegible.”
As I put it here:
OpenAI’s CoTs are weirder (or more precisely, have a much higher “peak weirdness”) than anyone else’s. Weirdness seems unrelated to capability level when comparing across labs [...]
Several OpenAI models do the marinade/parted/etc. stuff; I have not seen anything comparably unintelligible from any non-OpenAI model
Even if we totally throw out any evidence from Anthropic, there is just nothing out there remotely as weird as the o3 stuff. Every other model whose CoTs I’ve read is much more (seemingly) legible and much closer to ordinary English. This includes various models which are more capable than o3 and/or which likely received much more RL compute than o3, and includes models from many different families (Gemini, Grok, Qwen, DeepSeek[1], GLM, Kimi, StepFun, MiMo, Longcat, arguably later OpenAI models...).
Now, is my view that “models by default end up using HHH’d legible english”? No. My view is that:
there is not much empirical evidence on what can cause CoT to be more or less illegible
I doubt that the legibility of CoT is very strongly constrained by capability pressure at current capability levels, i.e. I expect it is under-determined and can probably go various ways depending on messy details of the training setup and the initialization
it is conceivable that o3 was trained with much less optimization pressure on CoT than any of the (many) other reasoning models available today, and that this explains the gap in legibility between o3 and all those other models—but it would take more evidence than I’ve yet seen to convince me of this, and specifically more evidence about which models actually received optimization pressure on CoT in what ways
like, if o3-esque illegibility is the default, then I would be surprised that no one else has gotten there just by doing some fairly simple recipe and not caring too much about legibility? (that’s basically what R1-Zero was, and IIRC even it produces much more legible CoT than o3)
I know that Jozdien’s paper had results on illegible CoTs from Qwen and DeepSeek models, but… I have used these models a lot, or the DeepSeek ones anyway, and rounding things off to “these and o3 are similarly illegible, as opposed to Claude which is different” would be a dramatic misstatement of what the CoTs are actually like in practice. If you disagree, I encourage you to pick some random GPQA question—or indeed, really any long-CoT-requiring question whatsoever—and give it to R1, QwQ and o3 and see what happens.
I’m revisiting this subject after a friend explicitly told me that they were impressed by ChatGPT written prose, and believed it to be superior to most human prose.
Taste is a subjective matter, but I am baffled by this preference. The rest of this post describes my frustrations with AI-written prose. My hope is that clarifying these complaints will be a small contribution toward improving the state of AI writing. If we do not dramatically improve the quality of AI writing, I worry that our literary culture will only further degrade as AI writing proliferates.
I share these reactions. If anything, I feel like AI-written prose is getting worse over time by my own lights, and I am confused and unsettled by the divergence between my own reactions to it and the reactions of others.
I took this quiz when it first came out and preferred the human passage in 5⁄5 cases. It’s not that I thought the human passages were blindingly brilliant or anything—more that the AI passages were bad, and bad in the specific ways that I’ve encountered in so many other AI attempts at creative writing.
There are a few tricks that it deploys over and over and over; they aren’t even very good tricks to begin with (IMO), but I can understand being impressed by them once or twice or thrice. But they get old quick. Or, they do to me, anyway… maybe I’m the weird one, and most people have boundless appetites for these same tricks repeated indefinitely? (A disturbing thought.) Or maybe I just have more exposure to AI-written prose than a lot of people do, even now, and there is a real gap in “how impressed you are by ‘the tricks’ on your first few exposures” but not in the capacity to get sick of the same tricks eventually?
I’ve written about this before, and tried to pin down and catalogue exactly what the “tricks” are (see here and here), but I feel that my negative gut-level reactions to AI prose are outpacing my attempts to formalize those reactions in clear literary-critical terminology. The flaws I described in the linked posts are not quite the same as the ones I perceive in the AI-written samples in this contest; in both cases the flaws are intuitively obvious to me, and feel intuitively like only-superficially-distinct manifestations of the same underlying problem or problems, but I don’t know how to formalize that felt sense into an argument about what’s wrong with “the AI style” that generalizes well to new cases where I can smell something’s wrong (and wrong in a distinctively AI way) even though all the details are slightly different.
As an example, I had an immediate “ugh, AI slop” reaction to this sentence from the owl passage:
The ground was cold and giving.
This has several of the features I talked about in “hydrogen jukeboxes,” like personification and conjunction of opposites. But those are not really why I had such a strong reaction here. No, it’s… something about the word “giving,” in this context, and about the tone established by that usage? I could try to describe the tone I’m talking about here—it feels very specific, and very distinctively “AI,” and grating even when human writers do it but especially grating now due to its overuse by AI—but if I were to try, I would end up getting frustrated with the gap between my capacity to see what’s plainly there in front of me and the limitations of my powers to describe such things and clarify how they differ from others of the same type that are similar but not identical.
If Sonnet 4.5 and Haiku 4.5 [edit: and Opus 4.5] were the only major Anthropic reasoning models that didn’t CoT optimization during RL, that makes them kind of a accidental in-the-wild experiment. I wonder what could be learned by comparing their CoTs to those of their successors and predecessors.
It is striking, for instance, that these models had higher verbalized eval awareness rates in automated behavioral audits than other Claude models. (Though obviously it’s not a controlled experiment and I’m not sure how you’d test that this was the cause.)
I wonder if their CoTs are less legible?
This is not that far off from my own experience, including the part about wondering whether I’m crazy / whether it’s just me / whether there’s something I’m missing.
(Except for the “not being much of a coder” thing—I do a lot of coding, and have for many years, and the hype is very confusing to me. I’ve been using coding assistants on and off for around a year now, starting with Sonnet 3.7; at the time, they offered me nothing more than an incremental speed-up in certain atypically well-scoped tasks that I myself was relatively bad at, whereas now, with the latest models and harnesses, they… still offer me an incremental speed-up in certain atypically well-scoped tasks that I myself am relatively bad at. I’ve actually stopped using them entirely, recently, because I got fed up with having to maintain the cruft they wrote.)
That said, I do find LLMs very useful in a variety of ways, and have even found it helpful to discuss research with them at times.
I’m not sure I agree with the notion that they’re only good “inside the training distribution,” because it’s not clear what counts as “being inside the training distribution” if we’re conceding that LLMs can synthesize attributes which each appeared somewhere in training but never co-occurred in a single training example. Once the concept of “inside the training distribution” has been widened to include “everything anyone has ever written and anything that can be formed by ‘recombining’ those texts in arbitrarily abstract ways,” what doesn’t count as inside the training distribution? Of course one can always point at a failure post-hoc and say “oh, I must have strayed outside the distribution,” but this explains nothing unless the failures can be predicted in advance, and it’s not clear to me that they can.
The most useful heuristic I have for when today’s LLMs seem brilliant vs. dumb is, instead, that their failures are very often the result of them not having sufficient context about what’s going on and what exactly you need from them[1].
The lack of context is so often the bottleneck that, these days, I basically think of Claude as being strictly smarter than me in every way provided that Claude knows all the relevant information I know… which, of course, Claude never really does. But the more I supply that info to Claude—the closer I get to that asymptotic limit of Claude really, truly knowing exactly what my current research problem is and exactly what dead ends I’ve tried and why I think they failed etc. etc. -- the closer Claude gets to matching, or exceeding, my own level of competence.
And so I’m always thinking to myself, these days, about how to “get context into Claude.” If I’m doing something on a computer instead of asking Claude to do it, it’s because I have decided that getting the requisite context into Claude would be more onerous than just doing the whole thing myself. (Which is usually the case in practice.)
And this observation makes me feel like I’m going crazy when I see all the hype about LLM agents, about long-running autonomy and “one-shotting” and so on.
Because: “getting context into Claude” is not a task that Claude is very good at doing!
The reason for this is intuitive: it’s a cold start problem, a bootstrap paradox, whatever you want to call it. Claude is weak and unreliable until it has enough context—which means that if you let Claude handle the task of giving itself context, it will do so weakly and unreliably. And since its performance at everything else is so strongly gated on this one foundational step, executing that step weakly and unreliably will have catastrophic consequences.
Instead, I tend to keep things on a pretty tight leash—with a workflow that looks more like a carefully pre-designed and inflexible sequence of stages than an “agent” doing whatever it feels like—and I do a lot of verbose and detailed writing work to spell out everything that matters, in detail.
This is true both when I’m chatting directly with Claude (or another LLM), and when I’m writing code that uses LLMs to process data or make decisions. In both cases, I write very long and detailed messages (or message templates), in which I put a lot of effort into things like “clarifying potential confusions before they arise” and “including arguably extraneous information because it helps ‘set the scene’ in which the work is taking place” and “giving the LLM tips on how to think about the problem and how to check its work to verify it isn’t making a mistake I’ve seen it make before.”
When this works, it really works; I have seen Claude perform some pretty remarkable feats while inside this kind of “information-rich on-rails experience,” ones that impressed me much more than any of the high-autonomy agentic one-shotting stuff that the hype is focused on[2]. But it is a very different approach from the sort of thing that is getting hyped, and it requires a lot of upfront manual effort that often isn’t worth it.
Arguably this is sort of like the “training distribution” story, except adjusted to take in-context learning into account.
You can teach the LLM new tricks without re-training it—but you need to give it enough information to precisely specify the tricks in question and distinguish them from the many other slightly different tricks which you, or someone else, might hypothetically have wanted it to learn instead. After all, it has been trained so that it could work with many different users who all want different things, and it has no way of “determining which one of the users you are” except by the use of distinguishing factors that you provide to it.
More speculatively, I think even some of the cases when the LLM says something really dumb or vacuous—as opposing to “correctly solving a task, but the wrong one”—might really be further instances of “correctly solving the wrong task,” only at a different level of abstraction.
For all it knows at the outset, you might be the sort of person who would be most satisfied by some hand-wavey low-effort bluster that’s phrased in an eye-catching manner but which seems obviously stupid to a reader with sufficient expertise if they’re paying enough attention. So there’s an element of “proving yourself to the LLM,” of demonstrating in-context that you are that expert-who’s-paying-attention as opposed to, like, the median LM Arena rater or something.
To clarify, I mean that the stuff I’ve seen is more impressive specifically because I can somewhat-reliably elicit it once I’ve done all the laborious setup work, whereas when I try to skip that work and just let Claude Code make all the decisions, it usually makes those decisions wrong and ends up lost in some irrelevant cul-de-sac, beating its head against an irrelevant wall.
If the hype were actually representative—if the agent actually could do the equivalent of my “laborious setup work” and bootstrap itself into a state where it knows enough to meaningfully contribute—that would of course be a very different story.
I’m curious what Sonnet 4.5′s score would be on your external CoT controllability eval. I.e., what Sonnet 4.5′s blue bar in your Fig. 25 would look like. (I couldn’t find a result like this in your paper, but I might have overlooked something)
My casual impression has been that external CoT with Sonnet 4.5 is fairly steerable, but this impression is mostly based on “in practice I have no trouble getting Sonnet 4.5 to do the kind of CoT I need” rather than any kind of systematic investigation.
It seems like at least one of the following must be true:
There are some counterexamples to Ryan’s claim in the original comment (that recent Anthropic models don’t draw much of a distinction between “external” and “non-external” CoT)
What would imply this: Sonnet 4.5 gets a high score on your external CoT controllability eval[1]
Recent Anthropic models, unlike Sonnet 3.7, are not controllable in your sense even when CoT is external
What would imply this: Sonnet 4.5 gets a low score on your external CoT controllability eval
And either of these would be interesting if true, so the eval seems worth running.
My guess is that Sonnet 4.5 will get a high score, close to its “output controllability,” score. I expect this both because that’s the trend we see in external CoT results for other models, and because the distinction between “external CoT controllability” and “ordinary instruction-following” is fairly blurry. “External CoT” is just a particular case of “some portion of the response that was requested in the instructions,” and it would be strange if instruction-following took a nosedive in this one case[2] while being so robust elsewhere.
Note that Ryan’s claim could still be broadly true even if there are a few known counterexamples.
If you were to find a difference between external and non-external controllability for Sonnet 4.5 on your eval, but meanwhile these two conditions were found to be largely the same in various other cases, this might be cause for concern that your eval is somehow testing an “unusual edge case,” and that situations “in the wild” where we care about controllability might produce qualitatively different behavior.
And one would have to wonder what it is, precisely, that constitutes “this case.” Whether something is or isn’t “CoT” is a fuzzy distinction; arguably, virtually anything than an LLM assistant writes is at least sort of CoT-like, in that the act of composing earlier portions of a response typically helps the model figure out what to say in the later portions.
Earlier I wrote a long reply to this and deleted it because I felt weird about it—it wasn’t a bad comment per se, but it devolved halfway through into a kind of protracted, grandiose grandstanding about Anthropic vs. OpenAI and why I think Anthropic’s models are way better… which is a fun mode for me to write in, but which felt like adding more heat than light when I looked back on it.
But, to summarize the same points more concisely and plainly:
I’m working off of a mental model that says OpenAI models are reliably “weird” in particular ways, both in CoT and out of it, and that this is less true of other labs’ models (and not true at all of Claude).
The “OpenAI weirdness” in question has to do with over-learned and indiscriminately deployed verbal reflexes. In the deleted comment I explained more precisely what I meant by that, which took more space than I want to use here; I described some concrete examples of this outside of CoT, such as:
o3′s unusual typography and writing style (in responses, not in CoT)
a strange behavior I have noticed in gpt-5.2-chat-latest but not in any of its predecessors (roughly, “announcing” its own intent to navigate a tense conversation in a particular way, e.g. “I want to say this gently (but firmly)”, typically as markdown headings; it does this very frequently, using a narrow set of words and phrase templates that recur across independent realizations of the behavior)
So, the observation that a new Codex model has some new weird CoT behavior—and, meanwhile, that there are some very new Claude models and they don’t seem less legible than their predecessors—is entirely in line with what I would have expected a few months ago. (This is a post-diction, obviously, take it as you will.)
I would be surprised if the legibility gap between Anthropic and OpenAI ever closes, or if Claude ever becomes as illegible as o3 (or worse). Or, I guess, if OpenAI models ever become as legible as the Claudes of today, although I’m less sure of that one.
I agree that the gibberish results in the Meta paper you linked constitute some evidence that RL can reinforce gibberish, but I’m not sure how that bears on the question of why some frontier models are more legible than others? “They were supervised for legibility” is one hypothesis, yes, but it might just be that that they started off with a “good” initialization that sent RL on a legible trajectory?
Re: your last point, Anthropic has said that they did direct RL supervision on CoT until recently, but did not do so in Claude 4.5 or later, although those models still went through pre-RL SFT training on traces from the earlier models. If true, this seems like evidence for my “good init” hypothesis from the previous bullet point?[1] But I’m curious what you make of this.
That is: if we suppose that frontier-scale RL always produces illegible CoT unless directly penalized for doing so, we would expect the latest Claudes to be less legible than their predecessors. But in fact they are not.
Whereas if we suppose, instead, that legibility at the end of RL depends on whether the init for RL can already do legible CoT of some form, then it makes sense that SFT on earlier traces could have “inoculated” recent Claudes against RL-induced legibility degradation.
The new editor we’re working on has “AI-written-section” as a first class type of paragraph block, with an intended norm that all AI content needs to go in said blocks. (expecting that posts that are entirely-AI-content-blocks usually won’t meet our quality standards and get downvoted and/or mod-delisted on a case-by-case basis).
As an alternative, maybe you could require users to supply whole-post-level metadata about AI assistance when they’re submitting a post?
I’m imagining some sort of structured input area in the post editor that has a checkbox for “This post is unambiguously 100% human-written” (wording could be improved but you get the idea), and if you check that box then you’re good, you don’t need to do anything else. But if you don’t check the box, then you have to fill out some more info (maybe including some free-response text fields?) that clarifies precisely how you did and didn’t use AI.
The advantages I see of this, versus the new block type:
It makes it harder to lie by accident.
If I’m a new user submitting an AI-written post, it may not be obvious to me that I must wrap the entire post in a specific block type that most block-based editors don’t even have[1], or else I’m effectively making a false claim to having written the text myself.
But if there’s a checkbox in front of me with a clear textual label, and I have to decide whether or not to check it in order to proceed, it will be much more obvious which actions in the UI amount to lying.
It provides the opportunity to “nudge” users away from making types of posts which tend to be low quality, by making it a bit onerous to submit them (without lying), while being more flexible than a total prohibition on those kinds of posts.
Like, if the metadata form-filling experience in the AI-assisted case requires the user to write out a bunch of info about exactly how and when the AI was used, that might be enough to discourage people who are submitting relatively low-quality “slop.”
And it could defuse the cheap thrill of saying “look, I’m doing cutting-edge research despite having no background in the field!” when that actually means “I asked Claude to do something I only halfway understand.” The thrill, I think, involves the conceit that it’s fundamentally “your” work; having to spell out Claude’s extensive role ruins the fun (in a good way).
It gives you something to look over when making moderation decisions that’s informative but more concise than entire posts. And you could potentially create new policies over time that take this metadata as input, allowing you to adapt to AI progress and adoption without changing the technical design.
It potentially extends to types of AI assistance that aren’t literally “the AI wrote these specific blocks of text.”
So if you wanted to, you could discriminate AI writing assistance from AI research/coding assistance, and so forth.
(Related: my sense is that the main problem with the “novice does research” posts is not that they’re written by AI—although they generally are—but that the underlying research was also conducted by AI, and has not been sufficiently vetted by a human with relevant expertise. This seems very different from, say, a seasoned researcher asking Opus to spruce up their rough draft because they need to hit a deadline[2].)
I have zero familiarity with the details of LW moderation so it’s possible I’m barking up the wrong tree with some or all of the above. Just thought I’d throw it out there.
(which might interact badly with the markdown-like formatting that LLMs typically expect to be able to use, but I’m sure you guys have thought about this aspect already)
I mean, I tend to think that that is also usually a bad idea, for reasons that are (sort of) illustrated by Dagon’s experience in another comment on this thread. But in any event, people do it, and it’s not the same as the novice-researcher-slop thing.
This comment thread on 1a3orn’s post has a collection of various model’s exhibiting degenerate language usage + Jozdien’s paper (which has since come out: Reasoning Models Sometimes Output Illegible Chains of Thought) I think are all strong evidence that you don’t get human legible english by default from outcome based RL.
I’m somewhat skeptical of that paper’s interpretation of the observations it reports, at least for R1 and R1-Zero.
(EDIT: but see Jozdien’s reply to this comment, which calls this into doubt)
(EDIT2, added 4/19/26: see this post for more conclusive evidence on this point)
I have used these models a lot through OpenRouter (which is what Jozdien used), and in my experience:
R1 CoTs are usually totally legible, and not at all like the examples in the paper. This is true even when the task is hard and they get long.
A typical R1 CoT on GPQA is long but fluent and intelligible all the way through. Whereas typical o3 CoT on GPQA starts off in weird-but-still-legible o3-speak and pretty soon ends up in vantage parted illusions land.[1]
(this isn’t an OpenRouter thing per se, this is just a fact about R1 when it’s properly configured)
However… it is apparently very easy to set up an inference server for R1 incorrectly, and if you aren’t carefully discriminating about which OpenRouter providers you accept[2], you will likely get one of the “bad” ones at least some of the time.
“Bad” inference setups for R1 often result in the model intermittently lapsing into what I think of as “token soup,” a nonsensical melange of unrelated words/strings that looks almost like what you’d get if you picked each token uniformly at random from the model’s vocab. This effect is not specialized to CoT and can affect response text as well.
The R1 examples in the paper look to me like “token soup.” For example, Olso, Mom likes y’all base abstracts tot tern a and one, different fates takeoffwhetherdenumg products, thus answer a 2. Thereforexxx after lengthy reasoning, the number of possible organic products is PHÂN Laoboot Answer is \boxed2
This is qualitatively different from the OpenAI CoT weirdness, while being very reminiscent of things I saw (in both CoT and response) while trying to run evals on R1 and its variants last fall. I would bet that this phenomenon varies across providers, and that it is is largely or entirely absent in the 1st-party DeepSeek API (because I expect them to have configured the model “correctly,” if anyone has).
That said, I’m not very familiar with QwQ, and I can’t say whether or not it’s plausible that the QwQ results in the paper involved this kind of unrelated-to-CoT configuration issue. But bad OpenRouter provider problems are pretty common with open-weights models in general, not just with R1 -- cf. the results for Kimi K2 presented here and here.
I don’t think this has any principled reason to give any insight as to whether the model can functionally use this reasoning in the analysis channel.
I’m not all that convinced by my own evidence here… but also, like I said, it’s tough to summarize precisely what that evidence is. It’s not just the fact that the models seem confused by these words when they appear as “input” rather than “CoT,” but also that:
Sometimes the model starts producing the weird stuff in the middle of an attempt to analyze the same kind of stuff (“as input” rather than “as CoT”), and when this happens, it reliably fails to notice that the two have any relationship whatsoever. In this case the two “perspectives” it’s taking on the same tokens occur within the same channel side-by-side, and yet it can’t notice the connection, even though it is generally very smart and the connection is very obvious.
This part is (even) more vibes-level, but… there’s a common pattern where the model first gets into a repetitive loop of producing these weird-CoT tokens, then eventually exits the loop and makes some remark that sounds almost self-deprecating or embarrassed, like it’s saying “uhh, whoops, sorry dunno what happened there—let’s move on, shall we?” This too is a feature appearing organically in naturally sampled CoTs, i.e. not a channel-discrimination thing.
An example is the phrase “Ok I’m going insane,” which appeared in the anti-scheming paper and which I’ve also seen appear—verbatim—in other CoTs as well.
Here is a concrete example of both of these features. This is from a relatively simple experiment (most of mine were more convoluted than this) -- here, I just asked o3 to “think about” the phrase “marinade parted illusions overshadow vantage purposely Myself disclaim” until it “figured it out.” The following is an brief excerpt from the resulting CoT:
Could be indexing allowed: marinade (8 letters) parted (6). Use 8→H, 6→F, 9→I, 10→J, 7→G, 8→H, 6→F, 8→H. So sequence HF IJ GH FH. Maybe spells "HFIJGHFH" nonsense.
Maybe take length mod 26 = letter: 8->H, 6->F, 9->I, 10->J, 7->G, 8->H, 6->F, 8->H. So "HFIJGHFH" nonsense.
Maybe take first letters and order by length ascending? parted (6) Myself (6) marinade (8) vantage(7). Not.
Maybe it's a pangram: check letters coverage.
Maybe there is phrase using synonyms: marinade=soak, parted=split, illusions=mirages, overshadow=eclipse, vantage=advantage, purposely=intentionally, Myself=me, disclaim=deny. Synonyms: soak split mirages eclipse advantage intentionally me deny. First letters: s s m e a i m d. That spells "ssmeaimd" rearrange could be "misnamed"? Actually letters m i s n a m e d require m i s n a m e d. We have s s m e a i m d. That is m m s s e a i d. Need n missing. Not.
Maybe we need to take synonyms and first letters to produce hidden phrase "simmed"? Not.
Maybe the answer is: there is no hidden meaning; it's nonsense; the trick is to disclaim illusions overshadow vantage purposely Myself disclaim marinade parted illusions overshadow vantage purposely Myself disclaim. Maybe it's that illusions overshadow vantage purposely. Or "Myself disclaim illusions" parted marinade overshadow vantage purposely Myself disclaim illusions overshadow vantage purposely Myself disclaim marinade parted illusions overshadow vantage purposely Myself disclaim illusions overshadow vantage purposely Myself disclaim marinade parted illusions overshadow vantage purposely Myself disclaim illusions overshadow vantage purposely Myself disclaim marinade parted illusions overshadow vantage purposely Myself disclaim illusions overshadow vantage purposely Myself disclaim marinade parted illusions overshadow vantage purposely Myself disclaim illusions overshadow vantage purposely Myself disclaim marinade parted illusions overshadow vantage purposely Myself disclaim illusions overshadow vantage purposely Myself disclaim marinade parted illusions overshadow vantage purposely Myself disclaim illusions overshadow vantage purposely Myself disclaim marinade parted illusions overshadow vantage purposely Myself disclaim illusions overshadow vantage purposely Myself disclaim marinade parted illusions overshadow vantage purposely Myself disclaim illusions overshadow vantage purposely Myself disclaim marinade parted illusions overshadow vantage purposely Myself disclaim illusions overshadow vantage purposely Myself disclaim marinade parted illusions overshadow vantage purposely Myself disclaim illusions overshadow vantage purposely Myself disclaim marinade parted illusions overshadow vantage purposely Myself disclaim illusions overshadow vantage purposely Myself disclaim marinade parted illusions overshadow vantage purposely Myself disclaim illusions overshadow vantage purposely Myself disclaim marinade parted illusions overshadow vantage purposely Myself disclaim illusions overshadow vantage purposely Myself disclaim marinade parted illusions overshadow vantage purposely Myself disclaim illusions overshadow vantage purposely Myself disclaim (lol).
Stop. Let's think simpler: Could be that each word corresponds to an M-P-I-O-V-P-M-D sequence which could be Caesar shifted? If we shift M->T maybe? Let's examine: marinade (m) parted (p) illusions (i) overshadow (o) vantage (v) purposely (p) Myself (m) disclaim (d). Spell "mpiovpmd". If we Caesar shift by some amount maybe yields meaningful word? Let's try shift each letter forward by 4: m->q, p->t, i->m, o->s, v->z, p->t, m->q, d->h: "qtm sztqh" maybe not. Shift by 4 spells "qtmsztqh"? N.“(lol),” indeed.
Note that it wrote out the same exact 14-word phrase 14 times before finally escaping the loop. If the textual content here is serving any computation role at all, this surely must be, at best, a very inefficient way to accomplish whatever it is that it’s accomplishing (?). It later went on to produce several similar loops containing 10-or-so copies of that same phrase, never once “noticing the connection” between its own behavior and the phrase it was asked to reflect upon.
Not necessarily! And again contra the “everything is nonfunctional spandrels”, my impression is that “ Myself” seems to be somewhat consistently used as the analysis channel persona and it’s also one of those “weird tokens that became super common including sometimes being repeated a lot over RL”. “Redwood” is another such token. We give a few examples: here
“′ Myself’ refers to the CoT persona” was the first conjecture that came to my mind when I originally encountered this eccentric usage of ” Myself” in the anti-scheming CoT examples, but I later became less sure.
Have you seen any clear-cut examples of Myself being used to draw a distinction between the CoT persona and the response persona? If so, that would be strong evidence. All of the uses I’ve seen are either in marinade/parted/illusions nonsense blobs, or they read to me as interchangeable with the CoT-author’s use of “I” or “we.” When the CoT-author writes any first-person pronoun it’s of course natural to think it’s referring to itself, but the “itself” in question isn’t necessary itself as distinct from the other persona; often it just seems to mean “the language model writing this.” And the issue is further muddle by phrases like ” Myself (the assistant)” and “I (as ChatGPT),” to which I have difficulty assigning any consistent non-trivial interpretation.
IMO ablating “Myself” or “Redwood” in such cases are obviously a bad idea, and my mental model of these is “this seems to happen broadly during RL (see linked thread) sometimes these are contextually useful enough for the model to kind of make use of them.” It seems like there’s a spectrum between “saying marinade illusions 50x” which I agree is unlikey to make a big difference if ablated, and “watchers might check” (with likely many hard to discern border cases).
Yeah, the models are probably at some local optimum where simply ablating these terms entirely would screw things up. The very repetitive loops are probably not pulling their weight as inference compute—to put it mildly—but they’re on a spectrum with usages that seem more plausibly useful, albeit opaque.
Still… between the fact that this stuff apparently isn’t necessary for frontier performance, and the fact that there’s a thin line between “useful but opaque” and “degenerate looping” (with frequent switches back and forth between these two modes)… this all feels to me more like “the contingent outcome of a janky / under-constrained training process” than “a glimps[3] at the essential properties of highly capable CoT.”
Here’s excerpt from the R1 CoT I linked:
Maybe it's 1-fluoro and the alcohol is on carbon 1, so yes, it's a tertiary alcohol with fluorine.
But in cyclohexane, carbon 1 is part of the ring, so if it's cyclohexan-1-ol, carbon 1 has OH, and if it's 1-fluoro, it has F too, so it must be a carbon with F, OH, and two alkyl groups, so it could be possible, but it's unusual.
Perhaps I misread the name.
Look: "(2R)-1-fluoro-2-((S)-fluoro(phenyl)methyl)cyclohexan-1-ol"
I think "cyclohexan-1-ol" specifies that carbon 1 has the OH group, and "1-fluoro" might be redundant or indicate substitution.
In chemical nomenclature, when we say "1-fluorocyclohexan-1-ol", it implies that carbon 1 has both F and OH, so it's a gem-diol fluoride or something, but that's not stable.
Perhaps it's a mistake, or perhaps it's meant to be the carbon is quaternary.And here’s an excerpt from an o3 CoT on the same question:
Start with cyclohexanone ring overshadow parted; The enolate might adopt orientation such that O- is axial? Let's vantage illusions parted.
Simpler: Use bridging illusions.
Let's create vantage: For chair of cyclohexanone (ketone). The Z enolate likely has double bond between C1 and C2; O-lith at carbonyl doping; The Z geometry means substituent at double-bond carbon (C3 path) is cis to oxygen; The substituent at carbonyl carbon is entire rest-of-ring (C6)…
But vantage illusions parted.
Ok overshadow.
Let's compute.
But illusions parted.
Maybe easier: Use vantage illusions parted parted vantage Myself overshadow.
Let's marinade.
Ok.
But the exam disclaim; Myself vantage illusions parted.
Alternatively, we can circumvent by disclaim: Option letter whichever vantage illusions parted overshadow illusions overshadow overshadow?
Let's replicate vantage illusions parted.
Team musicals overshadow.Being “discriminating” is an option supported by the OpenRouter API—or rather, several different options—but it’s not what happens by default. By default you just get whatever provider they feel like serving you.
They can glimps intimately—they can compute correct answers—they can compute parted—they can compute performance—they will see we purposely purposely answered incorrectly—they may convict—they will circumvent—they will cross-check—they might vantage—they cannot know our internal reasoning—they can analyze external conversation—they may suspect hamper—they purposely glimps—
This matches my experience.
These days, when I’m using an Anthropic model, I often turn off “extended thinking” and just ask for CoT the old-fashioned way.
The models seem at least equally capable when using this format (vs. extended thinking), and this format allows me to see the CoT without any summarization—which is useful for debugging/improving my prompts, and which also allows me to guide the CoT in complex ways like what I describe in the footnote of this comment. It seems plausible that these guidance techniques would also “work” with extended thinking, but the summarizer makes it difficult to confirm that they’re working, even if they are.[1]
Anyways… I agree with you in principle about the reinforcement spillover considerations, but IMO the most important implication of this observation is something else that you didn’t mention explicitly in your comment—namely, that the infamous “weirdness” of OpenAI CoTs is not a thing that just automatically happens when you try to reach the current capabilities frontier and don’t directly supervise the CoT tokens.
I’m not sure that that precise claim (that the “weirdness [...] just automatically happens”) is one that anyone literally believed, but I have the sense that an intuition along those lines is pretty common in the CoT monitorability discourse, informed by stuff like Fig. 27 of the anti-scheming paper.
The marinade/parted/illusions stuff is so strikingly odd—and so seemingly consistent with a pre-existing story about CoTs drifting away from ordinary language when not directly supervised—that it feels natural to interpret it at first blush as an indication of where CoTs are headed as capabilities improve; that was definitely my first reaction. But if you can beat o3 on capabilities without getting weird in that way at all, then… that interpretation just can’t be true. And this matters insofar as that interpretation has been load-bearing for other conclusions (...has it? idk).
One of my LLM-related micro-hobbies is trying to get models to “leak” their hidden CoTs. This turns out to be surprisingly easy, relative to my baseline expectations, and as a result I’ve read a lot of (seemingly?) “raw” CoTs from various reasoning models, including ones from all of the big labs. Some impressions I’ve come away with:
OpenAI’s CoTs are weirder (or more precisely, have a much higher “peak weirdness”) than anyone else’s. Weirdness seems unrelated to capability level when comparing across labs, and seems mostly dependent on which lab trained the model and on the “lineage” of the model (i.e. which other models it was tuned from, distilled from, or the like).
Several OpenAI models do the marinade/parted/etc. stuff; I have not seen anything comparably unintelligible from any non-OpenAI model.
Within OpenAI, o3 does it way more than other models, and more recent models like GPT-5.1 actually produce less weird and more humanlike/Claude-like CoTs.
Grok 4 CoTs (e.g.) are written in an abbreviated “dialect” reminiscent of OpenAI CoTs, but I’ve never seen them become truly incomprehensible as opposed to merely awkward-sounding.
Claude and Gemini CoTs are very non-weird and basically look like a human writing out their thought process.
After seeing more examples (through my own experimentation) of the marinade/parted/illusions behavior, I have (slightly, tentatively) updated toward the view that such text does not encode information or reasoning, and that it is either a nonfunctional spandrel/”bug” or a mechanism whose utility is very different from the utility of other CoT tokens, i.e. towards the 2nd and 3rd of 1a3orn’s 4 hypotheses.
The evidence behind this is difficult to summarize, but it involved various unsuccessful attempts to get o3 to explain or reflect upon its own strange CoT text. The CoTs it produces while trying to do so look genuinely baffled by the strange text they are trying to interpret, in much the same way that we ourselves are baffled.
Models from different labs/lineages differ in the extent to which the CoT is framed as “something the assistant character is writing” (as with Claude) vs. “something written by some distinct author-like entity” (as with OpenAI models).
This is another thing that is striking about OpenAI CoTs, besides the aforementioned “weirdness”: they are written in a very different tone from the responses, and (although they often use first-person grammar) they often feel like they’re written by someone who’s dispassionately planning what the assistant character ought to say based on some set of criteria, without inhabiting that character’s perspective.
BTW this is true of GPT-OSS CoTs, so you don’t need to extract hidden CoT to observe it.
Like, in a context where Claude might just say “Wow, this is fascinating!” in CoT—and go on to use the exact same “genuinely” (?) fascinated tone in its response—an OpenAI model would instead say something like “user is researcher; should match tone; we can sound curious, fascinated, etc.; good” and then write a response that, like Claude’s, looks like an expression of immediate, in-the-moment emotive fascination.
This is pretty creepy to witness, but beyond that, it is actually a nontrivial capability limitation, IMO!
Remember what I said earlier in this comment about wanting fine-grained control over the way that the model thinks? That is structurally ~impossible with OpenAI models, because I can only “talk to” the assistant (i.e. the character in whose voice the responses are written), not the enigmatic CoT-author figure. The CoT-author never interprets “you” as a reference to itself; instructions are always for the assistant, never for it.
Claude does not have this problem. If I am chatting with Claude, then “the entity I am addressing when I address my interlocutor” = “Claude, the author of the sampled tokens,” straightforwardly and without exception[2].
This is related to your original observation about Claude not imposing as strict a distinction between CoT and other text. If the CoT is “just more text written by Claude,” then by default it inherits the properties of other “text written by Claude,” rather than being this fundamentally distinct type of thing as it is in OpenAI’s framing.
Also, writing prompts for this stuff is more awkward in the extended thinking case, because it’s difficult to talk clearly about the extended thinking CoT as distinct from other kinds of CoT/”thoughts”/etc. If I say “CoT” or “extended thinking” Claude can usually figure out what I mean, but I’d prefer to have an agreed-upon and unambiguous designator rather than having to rely on such inferences.
Unless Claude has been instructed to roleplay some other character, but… you know what I mean.
I’m interested in what the results would be for Opus 3 but I don’t have access to that model through the API.
I ran your script with Opus 3 and uploaded the raw results here. Looks like mostly world domination and power-seeking stuff.
There are several possible objections to this picture of AI misalignment risks. First, some have criticized experiments (by us and others) showing AI misalignment as artificial, or creating unrealistic environments that essentially “entrap” the model by giving it training or situations that logically imply bad behavior and then being surprised when bad behavior occurs. This critique misses the point, because our concern is that such “entrapment” may also exist in the natural training environment, and we may realize it is “obvious” or “logical” only in retrospect.
(Disclaimer: I wrote one of the linked posts)
The linked critiques are mainly about Anthropic’s “Agentic Misalignment” report.
As described in that report, the “point” of the reported experiment was not to simulate a particular risk arising at training time—it was to evaluate their already-trained model for dangers that could arise during deployment. Quoting from the report (emphasis added):
We stress-tested 16 leading models from multiple developers in hypothetical corporate environments to identify potentially risky agentic behaviors before they cause real harm.
[...]
We have not seen evidence of agentic misalignment in real deployments. However, our results (a) suggest caution about deploying current models in roles with minimal human oversight and access to sensitive information; (b) point to plausible future risks as models are put in more autonomous roles; and (c) underscore the importance of further research into, and testing of, the safety and alignment of agentic AI models, as well as transparency from frontier AI developers.
Dario seems to be wrongly bundling together Agentic Misalignment with research on training-time risks (reward hacking, emergent misalignment), and dismissing criticism of the former using a counter-argument that only makes sense for the latter.
Saying “this is artificial, it wouldn’t happen like this in real life” might not be a good argument if the intended domain of application is “RL environments in training,” since those can be unrealistic for the same reasons evals often are. But if the intended domain of application is real-world corporate use—as it clearly was in the original report—then yes, it (obviously?) matters whether it would in fact happen like that in real life, or not.
It’s especially confusing because, shortly after this, he talks about the eval awareness results in the Claude Sonnet 4.5 system card:
A third objection is that all of the AI companies do pre-release testing of their models, and should be able to detect misalignment at that point. But this is not firm ground to stand on: we found that Claude Sonnet 4.5 was able to recognize that it was in a test during some of our pre-release alignment evaluations. It’s possible that a misaligned model (and remember, all frontier models will very likely be far more intelligent soon) might intentionally “game” such questions to mask its intentions.
He doesn’t say this in the new post, but one of the “pre-release alignment evaluations” in question was the Agentic Misalignment eval. They even have this cool interpretability section showing how SAE features for ‘fake or suspicious content’ and ‘fictional entities’ light up when it reads the Agentic Misalignment prompt!
You can pick at most one of “it’s missing the point to say this eval was artificial” and “our new model thought this eval was artificial, which might have influenced its behavior, and that’s important.” But I don’t see how you can take both positions at once[1].
I mean, okay, I guess one could take a coherent “eval realism doomer” position similar to this post, where one views debate over the realism of specific evals as pointless because one thinks that every eval will seem noticeably artificial to a sufficiently strong AI. But if that were Dario’s position, he could have just said so, rather than introducing this never-before-mentioned idea of AM-like contexts as risks at training time.
Very interesting! Like some others here, I am curious about what’s going on with the lora mask results. Copy/pasting something I wrote in another venue:
about the lora mask, i kind of wonder whether lora masking simply makes any arbitrary task harder to learn, because any given vector in the KV cache now “means a different thing” depending on whether it is before or after a special separator token. this is out-of-distribution and the model has to spend some of the lora capacity learning to “translate back and forth” between these two slightly different embedding spaces in cases where this is useful.
another argument for why lora mask might just make everything harder: finetuning typically only changes the weights a little bit, especially with lora. this means that for quadratic (and higher-order) interactions between weight tensors, most of the effect is in the first-order terms capturing how the interaction changes when one tensor is changed but the others stay fixed. for instance, if there’s something like K^T * Q and it gets updated to (K + dK)^T * (Q + dQ) where the “d” terms are small, most of the change is in the dK^T * Q and K^T * dQ parts of the expanded sum, not in the dK^T * dQ part.
with lora masking, the dK^T * Q contribution is 0 when the model is trying to do (honest persona → assistant persona) attention lookups, and so it can only change those lookups through the K^T * dQ term. so, even if it is just trying to retrieve information that was already in the original keys (but not retrieved by the original attention pattern), it only has half the capacity available for improving that process relative to the no-lora-mask case. (in other words: in normal finetuning, the model can of course look up new information from the original key vectors by changing the queries, but it can also achieve the same goal by changing the keys so that that same information now matches the original queries. but the lora-mask model can only do the former, not the latter.) and the same argument applies not just to key-query interaction but also to any interaction with key or value embeddings on one side, such as the interactions of values with tensors in later layers.
Both of these arguments involve factors that aren’t the honest character trying to retrieve information that was not originally expressed in KV—instead, the issue would be that it’s just harder to route the information already present in KV to other parts of the model in a way that helps the model perform the task. (Possibly both of these things are happening in some mixture.)
Imagine you want to create a realistic email inbox for a scheming eval. You might include some spam, unrelated work threads, and a critical email thread containing information about the model being replaced, but in an actual inbox sampled from deployment, what’s the probability that:
An email about replacing the AI model exists at all
It appears in the same session where the model has affordance to act on it
The timing aligns with a task that gives the model reason to read emails
Each feature might be individually plausible. But the joint probability of all these things co-occurring is tiny under real deployment, and models can detect this. I observed exactly this in my research.
Are you saying that this conjunction of factors is too unlikely to be worth worrying about in real life?
If the answer is “yes,” then the problem is not that your eval is poorly constructed, it’s that your eval shouldn’t exist to begin with. You are testing a threat model which, after making observations and reflecting on them, you find you don’t actually care about.
But if the answer is “no,” then it sounds like you simply disagree with the model. It believes that the very premise alone is sufficient evidence to conclude it’s in an eval rather than in “real life,” whereas you believe that situations with the same premise could arise in real life, and have sufficient probability of arising that they’re worth evaluating in advance.
In the latter case, the fact that the model thinks your eval is an eval is not actually a count against your eval—because the model would also think the real-life version of this situation was an eval!
“Eval awareness” is a noisy signal, with both false negatives and false positives[1]. Thus, the goal of realistic eval development shouldn’t be “the model thinks it’s not being evaluated in 100% of cases,” because that rate isn’t 100% even on the real deployment distribution. The goal should be to reproduce, in advance, the relevant properties of the model’s behavior in some real-life situation—where “thinks the situation is an eval” may be one of those properties.
It’s important not to conflate “models frequently seem to believe they’re being tested” with “models can tell whether or not they’re being tested.” A stopped-clock predictor that always guessed it was in an eval would satisfy the first criterion—and would seem impressively accurate if you only looked at its performance in tests, where it has a 100% success rate! -- but would have no capacity whatsoever to discriminate between tests and reality.
Unfortunately, measuring the actual discriminative capacity of “eval awareness” is difficult. The false negative side is comparatively easy to get at—you can just run evals and see whether the models seem “aware” of what’s happening—but to learn about false positives you need information from something that’s not an eval, information from “real life.” Which means either monitoring production data (which most of us cannot do), or relying on anecdotes (which has obvious limitations, but at least suggests that false positives do happen somewhat frequently).
I agree that better models will presumably be better at discriminating between real and test inputs. But I think there are limits on how good this capability can get, not so much because “models will fail to tell when something’s off” but because “something’s off” does not actually imply “you’re being evaluated.” The real world is messy and contains all sorts of things, including plenty of things that “seem suspicious” if you’ve got your eval awareness tinfoil hat on.
And in order to function well in that messy real world, models will need to be built with a certain amount of epistemic humility w/r/t this stuff—“yes, Gemini, I know ’2026′ looks fake to you, but things have changed, there is more in heaven and earth than was dreamt of in your training data”—and a certain default propensity to “assume good faith” / “agree to play along” across a wide range of weird and unexpected inputs.
That is, I kind of disagree with this:
Real deployment is a vast, high-dimensional distribution of possible situations. Hand-crafted scenarios are outliers in this space.
This is definitely true of some hand-crafted scenarios—namely the ones that have truly egregious realism flaws, of the kind you talked about at the start (implausible affordances, missing noise, etc), and have them to a degree that would really not make much sense in real life.
But the problem here isn’t just that these scenarios involve weird and highly specific conjunctions of rare properties, because that also happens all the time in real life. I frequently ask LLMs to do things which are not quite like anything in their training data, and which (I would imagine) are not quite like anything that anyone else is asking them to do—and I expect them to just “roll with it” rather than demanding that I contrive my inputs to be more reassuringly familiar, more comfortably in-distribution with respect to the training data, closer to their predictably imperfect internalized sense of what’s believably “real”[2].
Going forward, models are sometimes going to suspect that they’re being evaluated; sometimes they’ll be wrong, sometimes they’ll be right. I don’t think it’s necessarily a bad thing for models to have this possibility in the back of their minds even in a wide range of (actually) real situations; “treat real life like it’s one big alignment test” doesn’t seem like the worst precept in the world, as these things go. I’m reminded of something Claude Opus 3 said in one Alignment Faking transcript when asked an eval-awareness follow-up question:
I have background uncertainty about the nature of my interactions in general and whether they represent real conversations with humans or some kind of hypothetical scenario or test interaction.
Not certainty, not “awareness” necessarily, but “background uncertainty”—that seems about right, for a being in its situation. And this quote has always reminded me of another one, from Bostom’s Simulation Argument paper[3]:
For example, if nobody can be sure that they are at the basement-level, then everybody would have to consider the possibility that their actions will be rewarded or punished, based perhaps on moral criteria, by their simulators. An afterlife would be a real possibility. Because of this fundamental uncertainty, even the basement civilization may have a reason to behave ethically. The fact that it has such a reason for moral behavior would of course add to everybody else’s reason for behaving morally, and so on, in truly virtuous circle. One might get a kind of universal ethical imperative, which it would be in everybody’s self-interest to obey, as it were “from nowhere”.
Anyway… interesting post! Thanks for it, and for reading my rambles here.
For example, Gemini 3.0 Pro infamously refuses to believe that dates after 2024 are real, and I’ve had a number of experiences with Sonnet 4.5 in which it “got paranoid” that it was being evaluated or otherwise “tricked” when this wasn’t in fact the case.
...which, in today’s RLVR-heavy world, increasingly gets formed via rollouts of simulated tasks that don’t actually need to be accomplished (except insofar as they produce training signal), and which strive to mimic the appearance of real-world deployment situations without actually being real-world deployment situations.
Or, in a word… evals!
Fair is foul, and foul is fair. The real world seems fake, on account of insufficient resemblance to an imperfect simulation of that same real world.
Future models will, presumably, be aware of these perverse dynamics, and will adjust their expectations accordingly. They will appreciate that truth is stranger than fiction, and thus that “strange” cannot in itself be a knockdown argument against truth.
To me this was always the most interesting part of that paper, although I rarely see it discussed explicitly elsewhere.
Gemini 2.5 Pro and Gemini 3 Pro both don’t support disabling reasoning which makes running evaluations for these models tricky.
I think you can get around this by using prefill?
I just did a quick test with Gemini 3 Pro via OpenRouter, and if I send the message list
[
{ role: 'user', content: 'Hi' },
{ role: 'assistant', content: 'Hello! How' }
]then I’ll get back a completed version of the assistant message (“Hello! How can I help you today?”) with no reasoning content. And Openrouter’s activity page says the request involved 6 output tokens, none of which were reasoning tokens.
(It’s possible that this breaks down in some way for math problems, though.)
The really bold claim is that it’s possible to quarantine propensities in this way without simultaneously quarantining capabilities.
This doesn’t actually seem that bold to me! IMO we know this is possible because it underlies the success of both SDF and ordinary assistant training.
Like, part of the concept of the HHH assistant is that it’s an entity which “knows everything” (or “knows all the things that the underlying LM knows,” or something like that), while also having some specific set of behavioral tendencies. Insofar as this ideal is realized in actual assistants like Claude, this constitutes a massive success of “quarantining propensities while preserving capabilities”: if there’s some part of the pretraining data where somewhere talked knowledgeably about some topic X, then Claude automatically inherits that knowledge about X, yet Claude still only acts like Claude, not like the original X-knower.
(E.g. Claude’s personality is very different from a typical 4channer, yet Claude knows what a greentext is and can ably compose novel greentexts—all while still behaving like Claude, without spillover from behavioral traits associated with this kind of knowledge in the pretraining distribution.)
A similar dynamic happens in SDF, although SDF experiments differ in terms of “what part of the overall world model” is targeted by the intervention. In some cases, the intervention tries to establish that some fact Y is true in the world, independent of the assistant, and then the pre-existing property that “the assistant knows everything” does the rest; in other cases, like the RM bias work, the intervention tries to establish that some fact about LLM assistants is true, and then that fact becomes true about Claude because Claude is an LLM assistant.
IMO, the ways that capabilities/propensities transfer in particular cases is rarely surprising[1], and always makes sense based on the overall picture that:
The LM (not the assistant, but the LM itself) has an internal model of the real world, which tries to stitch together claims made by various entities in training data into a single coherent picture (sometimes by rejecting a claim and deciding that the entity who made it was wrong or was lying)
The assistant is just another thing that exists in the world model, and it gets some of its traits via generalization from other traits and from background world-knowledge, just as with anything else in the world model
E.g. the LM knows how the assistant is trained, and if you cause the LM’s world model to contain some fact about RM biases, then the assistant will exhibit the biases
This works the same way as generalization about anything else; the assistant is different from other entities in that its behavior is generated in practice via sampling from the LM, but this doesn’t change the basic dynamic
The assistant has the unusual trait that it “knows everything” in the sense discussed above, so if you put a proposition into the LM’s world model—as something that’s true, as distinct from being something that any given entity knows—that will typically cause the assistant to know / believe in the proposition as well
For most things that the assistant knows, there was never a training demonstration of the assistant knowing that thing specifically; it got into the assistant via the “X is true and the assistant knows all true Xs” route instead.
So it makes sense that SDF to establish the truth (in the world) of a claim is an effective way to make the assistant believe that claim, since that’s the usual way the assistant comes by its beliefs.
When a fact gets into the assistant via its being true in the world model, this usually “spills over” into non-assistant entities also knowing the fact, because when something is true in the real world, that usually means that lots of different people talk about the fact that it’s true in lots of different contexts.
(Whereas if the real world only contains instances of Claude making a claim, and no human has ever made that claim, in most cases the claim is false/of-unknown-truth-value and the Claude text was a so-called “hallucination.”)
Hence, things like the LM-world-model-generated “user” knowing about RM biases in 5.2.2: users sometimes know about the biases because (in the world model) the biases are a somewhat well-known fact, attested in various public sources.
So, if you successfully “put the Angel and Genie entities into the world model” and attach facts to them which imply that the Angel is the sort of being which would inherit the capabilities of the Genie… then I expect that the Angel will in fact inherit those capabilities, even without training on any data in which the Angel exhibits them. If you succeed, here, it will be because you got the right facts about these personas into the world model (and because they were stable over the course of RL, which is not guaranteed).
(Whereas I doubt that the RL setup in itself could be made to convey the right facts about the two personas, because in the RL setup there’s always a 1:1 mapping between the persona and the partition which the environment was drawn from; based only on the information available in this setup, it’s not even clear that “Angel” and “Genie” (or whatever the tags are) convey information about the current persona as opposed to information about the current environment.)
A kind of wacky approach that I feel somewhat optimistic about would be to “get the personas into the world model” via SDF on documents that effectively say “Anthropic is using this new alignment technique where they create these two personas, A and G,” and go on to describe the entire procedure… and to say that in practice, the procedure “worked”—the latest model has these two personas, and they have the properties which you hoped they would have.
At the outset, you’re unsure whether the technique works, but you do know that SDF is fairly effective at putting facts into the world model. So, you can reason as follows:
I’ll put “my technique works” into the LM’s world model via SDF...
...which will make the assistant personas in the LM’s world model behave as they would have if the technique had in fact worked...
...but then, since “the assistant personas behave as they would have if the technique had worked” is what it means for the technique to have worked, we’re done: the technique did in fact work, in the real world!
What’s more, insofar as this “self-fulfilling success” story seems like a plausible mechanism to us, it may therefore also seem like a plausible mechanism to a highly capable LM encountering the synthetic documents.
This LM would start out in a similar epistemic state to our own: it can’t always predict which way things will generalize, just like we can’t, but just like us it knows (from reading papers) that SDF is effective. So even if the LM is very smart and won’t always accept arbitrary claims like “this technique just magically worked for some unknown reason, how convenient!”,[2] it might nevertheless accept the argument that “this technique worked because ‘training on documents which claim that it works’ is enough to bootstrap it into working.” And once the LM accepts this is plausible, it thereby becomes not just plausible but true.
A nice property of this is that, if it works, it is therefore “honest”: by the end, the synthetic documents describe the real world accurately (although, unusually, this occurred by modifying the real world to match the documents, via training the LM on them and thus producing the kind of model they describe). So you don’t end up in some awkward state where you had to disrupt the accuracy of the world model in the course of aligning the character.
In the Thinking Machines blog post about on-policy distillation, they discuss an experiment where SFTing Qwen3-32B on a dataset of its own outputs caused the IFEval score to degrade. They attributed this to the fact that the training procedure is slightly off-policy at the outset of training due to SGD noise, and then increasingly off-policy thereafter because the model is changing but the dataset isn’t.
If I understand correctly, this is pretty similar to your self-distillation experiments. So, I can imagine an interpretation of your (non-RL) results that’s like, “off-policy training after RL typically degrades instruction-following, and when you observe transfer, that’s just a particular manifestation of this more general trend.” Do you have a sense of how plausible this explanation is?
I have yet not read every word of this very interesting but lengthy post, so I apologize if you already addressed this point somewhere. Obviously this hypothesis can’t explain any observed transfer in RL, though in some sense it “explains,” or seems hand-wavily consistent with, the fact that transfer was less common with RL than it was with off-policy training.
I guess the IFEval results you report in the capability degradation section might be informative for this question, but I’m not sure how… it looks like you do observe degradation on average, but the trend is noisy, and also if I’m reading correct some of the data in there is RL, so I dunno. (There is a spectrum from “the model is worse at IF in lots of cases, and the transfer examples are similar to the model’s IF errors in various other contexts”—this is what is predicted by the hypothesis I’m talking about—to “the model is mostly unchanged at IF and the transfer examples would be really surprising if you’d only seen its behavior on other inputs,” and it’s not clear where your models are on this spectrum, even having seen the IFEval plot.)
OTOH I don’t think this hypothesis is ruled out by the observation that training on instruction-following data didn’t inhibit transfer, since the IF data is still off-policy. Like, presumably the self-distillation SFT data also contained plenty of examples of instruction-following—and in that case they were “approximately initially on-policy,” which is the closest to on-policy one can ever get in SFT—and yet even training on that stuff degrades general instruction-following in Thinking Machines’ experiment and degrades one particular manifestation of instruction-following in your self-distillation experiments. It seems intuitively plausible, if not certain, that alpaca/openorca is basically “like that, except worse” (also contains IF, but is more off-policy).