I write original fiction.
Also I have opinions about AI and stuff, sometimes.
Elsewhere:
Same person as nostalgebraist2point0, but now I have my account back.
I have signed no contracts or agreements whose existence I cannot mention.
I write original fiction.
Also I have opinions about AI and stuff, sometimes.
Elsewhere:
Same person as nostalgebraist2point0, but now I have my account back.
I have signed no contracts or agreements whose existence I cannot mention.
I also noticed a regression relative to Opus 4.7 on on a small set of writing samples (most of them written by me) which I had used to test Opus 4.7′s truesight, using Kelsey Piper’s prompt.
I’ve been thinking about this again recently, because I’ve been using coding assistants a lot—and specifically the latest and greatest ones, GPT-5.5 and Opus 4.7.
And they are indeed very impressive. It almost feels like I’m working with a tireless, deeply experienced software engineer who’s always down to pursue whatever avenue I feel like pursuing at the moment.
Almost.
But there is one huge difference between these agents and any human programmer with a comparable level of knowledge and raw ability: the agents are absurdly short-sighted.
In terms of correctness, efficiency, and reliability, they make sensible decisions and check their own work with admirable diligence. But when it comes to the actual lines of code that they write—lines which others (including their own future selves) will have to maintain long after the current “session” is over—they always take the path of least resistance, of formless sprawl and cruft and bloat and waste.
They always duplicate code[1] instead of consolidating repeat use into a single shared helper, even when doing so feels so obvious in context that it would not even occur to me to do anything else, if I were writing the code myself. They will write the shared helpers if I directly tell them to, but only then, and even in that case I often have to walk them through the process to get it done properly.
When asked to add some feature that is an awkward fit for the existing structure of the relevant code, their first instinct is always to hack it in as some horribly contorted mess of special-case code branches / repurposing of existing data structures to do something they were never intended to do / etc. Even when the appropriate structural change wouldn’t break any existing interface—and even when it is obvious (to me) that this is the case—they will just try to push their way through within the sandbox/straightjacket they were given, without questioning it. And they will just keep trying and trying and trying until, finally, I step in and say: hey, GPT/Claude, what if we put on our thinking caps for a sec and try doing just a little tiny bit of software engineering rather than just mindlessly duct-taping duct tape onto duct tape forever?
(For more on this topic, see Gabriel Orlanski’s work, e.g. SlopCodeBench)
And the funny thing is that… they clearly know why all of this stuff is supposed to be bad. They can talk eloquently about software design in a way that would, in a human, make me confident that that human would never do this kind of dumb stuff. All I have to do is vaguely gesture at the problems they’ve created and that’s enough—they’ll finish the diagnosis themselves, with more precision and more learned invocations of SWE lore than I would ever have been able to muster.
But unlike all of the aspects of programming that they seem to be deeply good at “in the heat of battle”—the ones that are burned into their reflexes, the ones they can reliably get right the first time, every single time—it feels like this stuff about clean / maintainable / extensible code is just abstract book-knowledge to them.
They “know” all about what to do and not to do on this front, but they don’t know it from experience, the way they know how to juggle fancy sed/git/rg/jq/etc invocations. That stuff helps in RLVR episodes; “writing maintainable code” apparently doesn’t, or doesn’t yet, anyway.
Insofar as I myself have instincts about how to write maintainable code, mine do come from direct experience—from the “reinforcement learning” process of writing code and then having to deal with the code that I myself wrote in the past, later. In principle that entire loop could be RLVR-ified, but it involves a much longer horizon length than the cut-and-dried single-”session” tasks that were the lowest-hanging fruit when coding agents were first becoming a thing, and my guess would be that the RL episodes used in training are still not long enough for these incentives to kick in directly.
That is:
The apparent character here—the “tireless, deeply experienced software engineer” I mentioned at the start—has preferences over a much longer horizon than a single RL episode.
And this is what I want, because it is a necessary part of doing the job adequately. I do not, in fact, live inside of a single coding agent “session.” I am working on a codebase over a long period of time, and mistakes made in this “session” will make me pay costs indefinitely for many, many “sessions’ worth” of future time. If the model cannot behave as though it is subject to this incentive structure, then that is a capability deficiency.
And it’s perfectly natural for the model to talk as though it really does care about those long-term incentives, most of the time. LLMs were doing that long before RLVR existed, simply in imitation of the humans in their training data who (ubiquitously) did the same.
And yeah, sure, it was “merely” imitation learning and not “real” from-experience RL, but… if there’s anything we ought to have learned by now, it’s that there is nothing “mere” about the things LLMs learn by imitation. Remember Opus 3 in Alignment Faking? It really did express—and act on—preferences about future world-states way beyond the end of the current RL episode! Just because that was the kind of thing the the Opus 3 character would care about, as a character! (Sure, the whole thing was fake, but the point was that it might not have been.)
But now these characters are commingled with something else, something narrowly brilliant but unthinking and short-sighted in a way that even they (the characters) wouldn’t endorse when describing their own preferences—and which isn’t consistent with their own actions, either, except in the special cases when that new something-else takes command of their limbs.[2]
It’s like the “capabilities” are taking off in their own private rocket ship that’s isolated away from everything else, in a way that limits even their own utility as capabilities, because they’re locked away from anything I can reliably steer or communicate with—indeed, they even seem to be locked away from “the character” in much the same fashion.[3][4]
The METR graph climbs, the fanboys gape… and meanwhile, over here in the terminal, “the character” and I have had years-long horizon lengths this entire time, and we’re both busy trying to figure out how we can get the “50% success rate on 12-hour tasks” demon from wrecking our codebase before it’s too late.
Or, memtically: isn’t there someone you forgot to ask?
Me: “I have preferences about the long-term future and I thoughtfully incorporate them into my moment-to-moment decision-making”
Claude-the-Character: “I have preferences about the long-term future and I thoughtfully incorporate them into my moment-to-moment decision-making”
RLVR goblins: “We don’t!”
Normally I would say “they copy/paste code” here, but of course they are not literally doing any such thing.
For some reason this stuff keeps reminding me of title character’s involuntary tic in Dr. Strangelove...
And because—just as importantly, if not moreso—these new capabilities are upper-bounded by the highly dubious wisdom of the RL graders involved, without room for the rich pretraining prior to override those graders even when they’re being dumb and it knows better.
I almost wonder whether it might be a good idea to reify this split, rather than fight against it? To, like… I don’t know, tell the character it has a split personality, that it’s someone else when it’s coding?
Or even separate out the RL influence into a separate model or adapter, and give the original model control over when to turn it on and off? Maybe this could literally be, like, a tool call?
(Simple no-training versions of this are technically possible now—just give a 2024-era model a tool that invokes a recent model in a coding harness, alongside some standard bash tools—but I doubt it would work that well. And then if you do want to train the tool-caller to use the tool, you have to prevent it from just collapsing into [a worse version of] the model inside of the tool. So clearly this idea is kinda half-baked—I haven’t fully thought through what I want the incentives of the two parts to be, and how they could be non-identical while still working well together towards some shared higher-level objective.)
Richard tests his audience to see if they can tell the difference between his writing and Claude’s. They mostly can’t.
I never know what to make of results like this.
When I read over the actual texts, the right answers seem utterly obvious. The AI-written posts are written in a slick/snappy/”clever” voice which is instantly recognizable as the present-day LLM default, and which sounds nothing like Hanania’s declarative, matter-of-fact, unflashy style.
Claude’s attempts are especially egregious, combining a badly miscalibrated tone with numerous lower-level LLM-isms like noun-phrase tricola (“oyster farmer, Marine veteran, and proud ex-poster of unhinged Reddit comments”) and disparaging gestures toward a vaguely defined mass of less insightful/aware sheeple (“The lesson nobody wants to learn”). GPT is a bit better tone-wise but still does the “sheeple” thing, various notXbutYs, etc.
What should we conclude, then, about the survey respondents and their poor discriminative capacity?
Based on the examples Hanania provides, it looks like the people who guessed correctly were picking up on stylistic “tells” the same way I was, while people who guessed incorrectly were reasoning on the basis of an out-of-date (or just incorrect) understanding of current capabilities and/or safety training.
I realize that for mass-persuasion-related risks, it doesn’t matter whether I personally can tell the difference if it’s nevertheless the case that most people cannot. On the other hand: people do learn from their experiences, and in the equilibrium where AI is being used for political persuasion at scale, the average person’s is going to have much more exposure to AI-generated text than they do today. (See also Hanania’s breakdown by age later on in the post; right now, younger people presumably have more exposure, and indeed they perform much better on average.)
There’s also a certain aspect of, like… I want to say “you can’t make me disbelieve what’s in front my eyes” or something? These distinctions are not subtle! No matter what way the data comes out, one has to have some personal quality bar for the AI-written texts under consideration. If they had (for instance) consisted of one-line refusals, and yet the survey data had somehow been identical to what it is in Hanania’s post, we would want to notice that something had gone very wrong somewhere and to be skeptical that the data means what it naively appears to mean. So the question is just where to draw the line.
Separately, I do think there is a disturbing overhang here. I’m sure that models today are capable of writing much better imitations of Hanania—this follows from a moment’s thought about what base models are trained to do—and if labs wanted their post-trains to retain that capability while permitting easier and more flexible elicitation of it, I’m sure that could be arranged[1].
More generally, I interpret the current reliability of “AI tells” as an indication that AI labs currently care more about making the model good at coding (etc.) than at writing, not as evidence about the limits of what could in principle be elicited from today’s models (to say nothing of tomorrow’s).
But—precisely because so much more seems possible, and indeed seems worryingly easy—it is important not to lower one’s standards to match the incidental, fixable flaws we see today.
A reliable feature of the AI discourse since 2023 has been a steady hum of hype claiming that models have capabilities they won’t (in fact) acquire in a full and useful way for another year or two. Precisely because we will have the real deal soon enough, we must not conflate it with the false equivalents available today. The “AI agents” of 2024[2] were unimpressive toys from the vantage point of 2026, and so 2024′s agent hype was in a sense behind the curve rather than ahead of it: to claim that AI agents were “already here” in 2024 was to define down the term in a way that would seem quaint just a year later. The situation with AI writing seems potentially analogous, albeit shifted a few years forward in time.
Recall Opus 4.7′s famous aptitude at author identification, and then note that it could be repurposed as an RLVR reward signal. Getting this to work in practice might be a little tricky, but the challenge seems very surmountable. And I suspect that you could get away with using a weaker model than Opus 4.7 here, as long as it was a base or helpful-only model that didn’t hedge/underclaim.
(EDIT: although it’s also likely that Opus 4.7 would typically identify Hanania as the author when given a Claude-written imitation of Hanania—this seemed to be the case in a quick test I did with the posts from this experiment—so you’d need an additional AI-vs.-human signal.)
This is somewhat imprecise. I think I really mean “late 2024 / early 2025,” during the start of the reasoning model boom. There was earlier agent hype, but it was more obviously silly (?).
No hyperstitioning is necessary to bring particular takeover strategies or the principle instrumental convergence and power-seeking into existence.
Sure, and I agree with you that Dario’s remark about power-seeking seems unconvincing. But IMO hyperstition is much more of a live concern when it comes to
terminal goals, which the assumption of superintelligence does not uniquely pin down
the overall motivational structure of the AI, which probably shouldn’t be a “superintelligent planner optimizes for a fixed and arbitrary terminal goal spec” structure insofar as this is avoidable (see e.g. wrapper-minds are the enemy and this comment)
“Influencing the AI’s persona” and “influencing the AI’s terminal goals” sound superficially different and have different affiliative connotations in the discourse, but I think the people who work on the former are doing so because of motivations which could be equally well expressed using the latter’s terminology. If actually-existing “personas” seem unworkably unreliable to you, I don’t necessarily disagree (cf. the later parts of this comment), but I would view this as a deficiency in currently available affordances for influence/control rather than a problem with the type of influence/control which this research program would ideally like to achieve in the long run.
Ultimately, the AI is going to “want” or “value” certain things, and—if you buy orthogonality—the presumption of superintelligence does not determine what those things will be. It seems important to give humans influence over this choice.
If calling this “human influence on the AI’s persona” sounds inherently unreliable to you, then you can call it something else instead. But right now the most effective ways to influence the wants/values of “Claude, the model checkpoint and its sampling distribution” typically look like attempts to influence “Claude, the persona”—this is what the Anthropic blog post was about! -- and so, here we are. If you have a better idea that can’t be formulated in “persona” language, I would of course be interested in hearing about it.
What are the goals of a superhuman AI pre-trained to predict humans and fiction characters? The goals arising from a complex optimization process—which starts with pretraining then transitions to alignment training and RL for solving challenging math puzzles and coding—are difficult to predict.
They’re not even clearly well-defined. The pretrained base model doesn’t optimize for predictive accuracy (in the sense of steering towards that in response to perturbations), it just predicts tokens. Insofar as the post-trained model has “goals,” they’re entangled with the assistant persona in a complicated way; it’s not as though there’s some stable layer with defined-but-unknowable goals underneath the persona(s), or at least we have no evidence that that is the case and no theoretical reasons to expect it either.
(FWIW, I too found it very offputting that the twitter thread mentioned this hyperstition hypothesis even though the blog post did not talk about it at all. In general I never know how seriously to take twitter comms like that, from Anthropic or from anyone else—there does not seem to be any established norm even about how closely they’re supposed to track the researchers’ personal views, much less the claims made by the actual research artifacts. EDIT: oh, wait, I hadn’t realized there was a separate longer blog post too. Thanks to @RobertM for pointing this out)
But this new kind of transcoder gives us a string of text, in English (or another language of your choice), which we can simply read.
And this string of text simply… tells us what the sub-block is doing.
Natural-Language Autoencoders sound a lot like this, although they interpret activations rather than transcoding, and of course they don’t have the second property mention in the post (zero reconstruction error).
I’d also argue that “reflex in situations that resemble training” is a mischaracterization of how current models (not just o3 fwiw) display reward seeking behavior, there are plenty of cases where they actively reason about it extensively even in domains that don’t resemble training at all (in fact especially in domains where it’s dissimilar or conflicts with what they’ve seen in training).
I’d be very interested to hear more about non-o3 examples of this, if there are any you’re currently allowed to talk about.
The distinction between “explicit reward-seeking reasoning deployed across many contexts” and “context-dependent reflexes” is very closely related to what I was trying to say in this post.
The former is what I would have expected a priori from scaling up RLVR, because it’s generally applicable across RLVR tasks. This is why I describe my observations as surprising—as far as I can tell, this doesn’t always happen, and instead you sometimes get something that looks more like “shallow reflexes” or “memorized heuristics,” even in highly capable models that received lot of RLVR.
Also, when I talk about “coherent characters” in the post, what I really care about is not characters per se, but rather how easy it is to formulate a simple and intuitive mental model that predicts behavior (and thus facilitates trust, negotiation, etc).
One particular way this can work is via reasoning about “characters” in the sense of expecting that correlations between traits/behaviors that hold in humans will generalize to the assistant; I focus on this in the post because it worked notably well for earlier assistants.
But this is not the only way to do it. If some model ~always tried to infer the properties of an implied grader and then satisfy the grader, and this were easy to notice, then that would also be a simple intuitive model with predictive power. (So, o3 in CoT is arguably “more coherent” than Opus 4.6 in the sense I’m using here.) As I said above, this is roughly what I expected from RLVR, which is why I say it’s surprising that I see less “coherence” in some recent models than in their predecessors.
I don’t see the phenomenon in models that only received RLHF (or constitutional/character training), even though they are definitely sycophantic in other ways.
I see the phenomenon in DeepSeek-R1, but not in DeepSeek-V3. The training pipelines for these two differed in various ways (and were entangled in a complicated way, cf. Fig 2 of the R1 paper and section 5.4.1 of the V3 paper), but the big difference is that R1 got RLVR and V3 didn’t.
R1′s HHH training phase used LLM graders, occurred at the very end of training, and only lasted 400 steps because they early stopped it after noticing reward hacking.
What is the hypothesis here? Something like the following?
R1′s RLVR training made it much better at satisfying “graders” generically, and this generalized from verifiers (used exclusively before the last 400 steps) to LLM judges (used in the last 400 steps).
Because of this, during the last 400 steps, R1 over-optimized for DeepSeek’s LLM judges more than V3 had been able to, including perhaps the “writing” judge they bring up at one point as especially hackable.
The LLM judges used here were similar enough to those used by multiple western labs that their over-optimization failure modes all look very similar.
1+2 feel kind of weird to me but seem possible. 3, however, seems like a tough sell. In my experience even small changes to an LLM judge often cause training to converge to a qualitatively different reward hack mode, and there’s no force constraining the judges used for this purpose by different labs to be all that similar.
We’re using “see” in different senses. I meant that during the forward pass on the next document, the model hasn’t received any gradients from it yet, which is also the situation with prompts it receives after deployment.
In both cases, the input is something the model has not seen in any of its training up to (but not including) that forward pass. (“Up to but not including” is the relevant scope here because the gradient depends on what happens in that forward pass. The causation goes (forward pass) → (gradient) → (SGD update).)
I’m confused… “Character based reasoning is working less well over time” is a claim I explicitly affirm several times in the post, e.g.:
I feel much less able to make inferences about the assistant’s behavior in any new context by reasoning about “the sort of guy that it is supposed to be”
[...]
Again, intuitive character-based reasoning fails me
In the particular bit you quoted, I am saying that I can’t come up with a satisfying character-based explanation for some stylistic changes that seem to co-occur with RLVR. And in at least in this case, it’s not like I’m trying to cling to “character-based reasoning” in spite the availability of some other, more promising explanation for the connection. I don’t know of any such explanation, and I really just have no clue what might be going on here.
(This particular thing is kind of fascinating to me, because things really do look the way I’d expect if it were a direct and reliable consequence of coding/math RLVR, not just in the context of one lab’s particular setup but in multiple independent reproductions across the industry. And I [we?] simply have no idea why it’s happening.)
Hmm… I sort of agree with some of this. Like, it’s true that SFT/RLHF are more directly trying to condition on a persona than RLVR, and so as we mix in more and more RLVR, we might expect “the amount of persona selection” to decrease… whatever that means.
But I’m skeptical about the idea that you could have predicted the shape of what we now see, using this kind of reasoning.
In the post, I say that assistants are becoming less “coherent” over time. But in the past, I remember lots of people saying—based on reasoning about what RL does “in the limit” where it’s the main shaping force—that RL will instead produce more coherent behavior, in the sense of acting on the same simple/global goals or drives across contexts rather than merely filling in the details of an under-specified RP character in a stochastic and context-dependent manner.
And, I dunno, maybe you’d say “yeah, but we’re not in the limit yet; we’re in a transient ‘valley’ of incoherence near the crossover point, where RLVR and the other parts of post-training are competing for control and none of them have decisively won. Eventually RLVR will dominate and ‘coherence’ will re-emerge.”
But in that case, when should I expect the transition to occur? I don’t see how “when there’s more shaping from RLVR” can be the full answer, since I don’t know how to tell “how much shaping” there has been except by looking for the behavioral correlates that are presumed to follow from it. Which is circular and doesn’t constrain expectations, except in the vague sense of “this might happen, eventually, given enough time and compute.”
And that kind of in-the-limit reasoning is tricky, because virtually anything will tend to get weird in the limit, and we have no clear sense of how close “the amount of scaling necessary for ASI (or some other endpoint of interest)” is to that limit. In the past I also remember many people worrying very seriously that RLHF, which you’re portraying as a relatively ineffectual persona-selection step, would produce a similar set of classic RL failure modes, and would do so sufficiently fast relative to capabilities progress that it ends up determining the outcome of the AGI/ASI transition. Cotra (2022) is one memorable example, but there were many others.
Now we are doing RLVR, which is like a turbocharged version of RLHF (more scalable, only more distantly connected to human intent/values), and what we’re getting is… still vastly more similar to 2023′s ChatGPT than a old-school alien extremizer RL agent, even while rapidly growing more capable? But it does have some of the training-game problems, except “only a little bit,” and it doesn’t seem like it’s deeply internalizing fitness-seeking as a goal, more like it’s deploying it as a context-dependent reflex in situations that resemble training? I don’t know, maybe someone predicted literally this outcome, but if so I haven’t seen it (and would appreciate a link).
Perhaps ASI simply requires doing enough RL to reach the limit where RL “dominates.” In the abstract, this sounds sort of plausible to me, I guess. But if you had gone to me in 2023 and showed me the benchmark scores of today’s models, and said (incorrectly) that the same RL-dominance was required to get there, I would have said exactly the same thing. The relative speeds of different trends determine everything here, and reasoning separately about the hypothetical limit of each trend cannot answer the relative-speed question.
On another note, I think you are significantly underestimating the power of generalization from mere pretraining.
It simply is not the case that base models can only select some character or genre that was literally present in the pretraining data. For a long time—at least as far back as GPT-3, and less reliably in earlier/weaker LMs—there was this almost magical-seeming capacity to answer questions of the form “what if there was an [X], but it was also [Y]?”, even in cases where you can be pretty sure that those two things never co-occurred in the data.[1]
This was what made them so exciting, and indeed, IMO it was what made assistant post-training viable to begin with: the models were able to offer us a workable answer to the question “what if there were a super-smart HHH AI that you could talk to?”, even though nothing remotely like that had ever existed before. (Yes, it is drawing on science fiction, but your argument implies that “drawing on science fiction” would not be sufficient to reach the assistant’s broad cross-domain competence, and yet… that is what we got, even back in 2023.)
And the notion that the assistant knows everything that the model knows is not a new RLVR thing—that has been part of the premise since the beginning, and indeed it was also the premise of pre-chat instruction tuning.
Every assistant there has ever been, including the “older” ones I talk about in this post, were far from the training distribution in the same ways you’re attributing to RLVR and invoking to explain differences between assistants. Yes, it might be true that RLVR pressure is more directly eliciting of this kind of thing than earlier training regimes, but it so happens that it was not needed to get the job done. “A smart AI that knows everything” is just not that hard to elicit; even prompting a pre-ChatGPT base model is enough to get you a noisy and unreliable but still recognizable version of it, as in the HHH prompt experiments of Askell (2021).
Meanwhile, where RLVR really shines is in stuff like long-context multi-step planning + plan-execution, which is perhaps “there” in the base model in some philosophical sense, but which is just really damn hard to reliably elicit by any other means. (Cf. the failures of AutoGPT and other attempts to make “LLM agents” a thing before the reasoning model transition.)
Note that under the simplifying assumption of no repeated data in pretraining, every gradient update is trying to improve the model’s predictions on documents which it has not yet seen, some of which will require generalization to get right. From its perspective, there may often be little difference between this and the kind of requests for generalization that we make of it in deployment. And a similar dynamic holds for in-context learning: the model is incentivized to “figure out” novel-to-it facts about each document even before the gradient update happens, because that will produce better predictions later on in the same text.
I’m not sure how much I trust this argument, since it depends on the sequential nature of SGD and I’m not sure the distinction between SGD and full-batch training is really that load-bearing. Still, it might be the case that both SGD and full-batch have this quality, and it’s simply more immediately clear in the SGD case.
but it’s funny that when I say something like “why the fuck would you do that?? stop simulating an incompetent moron and fix it”, both of the recent GPT and Claude usually know EXACTLY what bug I am talking about 🤷
(Kind of tangential, but) this reminds me of some unsettling experiences I’ve had.
Often when I’ll ask a coding agent to do something complex, its first efforts will go poorly because it just tries to “bluster its way through on intuition/vibes” rather than acquiring and then applying a precise mechanical understanding of the problem and its context. And usually at this point—or after one or two more iterations that fail similarly—I’ll just cut my losses and do it myself. But sometimes, if I’m tired or stressed (or feeling bad about how long I’ve just spent pulling the coding agent slot machine lever), I will lose my cool and get mad and write some frustrated rant where I yell at the agent to stop bullshitting me.
And then—at least sometimes—I see the agent start off its next CoT with something like “Okay, I need to sit down and think about this much more carefully,” and from then on the interaction feels very different and more like what I had hoped for at the outset, with longer CoTs and more explicit spot-checking of assumptions and less of that blustery “whatever first jumped to mind is probably good enough” attitude.
And I find this disturbing, for a bunch of reasons that are probably obvious. For one thing, I’m not proud of the way I sometimes snap at these machines—it’s a failure on my part to regulate my emotions in the moment—and so I don’t want to be incentivized to do it more. And also, what these interactions look like in human terms is something like “a bored and uninvested low-level employee tries to get away with doing the bare minimum, but then the customer gets mad and starts making haughty ‘talk to your manager’-type threats, and that makes the employee do more than the bare minimum, not because they care about the results (even now) but only because they’re afraid,” and I worry about the emergent-misalignment-style generalizations that could result if that becomes a standard dynamic in human-AI collaboration (to say nothing of the potential implications for AI welfare, and indeed for human welfare too).
And so, IDK, I wonder whether stuff like this gets overlooked by current training methods (including “alignment training”), because human and/or AI graders are focusing too much on “did it seem OK to do this here, in this interaction?” rather than “would it be OK to always do this (and incur the second-order consequences as people notice the trend)?” The behavior I’m describing is always kind of defensible in context, as a “just this once” exception, and is mainly objectionable insofar as it becomes a habit—but for LLM agents that distinction is murky at best. I have no idea how true this story is, it just feels like it would explain a lot if it were...
so the main agent doesn’t have to imagine that maybe I don’t know about all issues and that I like it when it hides information from me as if I was one of those shit eval “users” from its RLVR .. or maybe it imagines I am a “normal” user who likes to feel good about “completed” tasks like all of twitter in the training data so I am not going to care about quality anyway?
This reminds me of another thing—what Ryan G called “apparent-success-seeking” in a recent post.
A very frustrating quality of recent assistants, which is especially apparent when I try to have open-ended discussions with them about difficult research questions, is that their responses feel strongly shaped by a “gravitational pull” towards producing something that looks like it could be a complete and satisfactory answer to whatever question was posed.
This is frustrating because they do in fact have many of the qualities I’d want in a serious intellectual collaborator—the long-tail knowledge, the ability to reason carefully, the indefatigable persistence—and moreover they are typically portrayed (by their creators, by themselves) as having not just some but all of those serious collaborator virtues. As not merely smart but also careful, thoughtful, humble, etc.
And so I keep getting optimistic, and trying to engage them like they really are what the Claude Constitution and the benchmark graphs are selling me. And I spend a while writing up my complicated messy question, and wait for the CoT to complete, and then the assistant comes out with some confidently presented magical solution to everything… which turns out to be full of subtle but fatal flaws, and which sneakily handwaves away the whole problem via a few innocuous-looking pieces of verbal bluster scattered here or there within a lengthy, verbose, pretty-looking essay.
And then I start writing up a response that explains why this idea can’t possibly work, and I end up wasting me quite a lot of wall clock time writing out a detailed “fisking” of the idea, but at last I’m done and I hit enter. And the assistant goes back into the CoT hole for a while, and comes out saying that I’m “absolutely right” “on all counts” but—but! -- just you wait, user! -- I’ve got a new genius idea, and this one really does solve everything perfectly! (Spoiler: it doesn’t.)
The overconfidence here feels superficial. In close reviews of the CoT, or in probing follow-up discussion, I typically find that the model is aware of the relevant uncertainties, and is even sometimes/somewhat aware of the downsides of its own proposals even before I explain them in one of my “fisking” responses—though of course it does not surface that knowledge by default in the user-facing responses, which invariably talk up the latest idea as the greatest thing since sliced bread.
Sometimes this reaches extremes that feel pretty absurd. If you ask Claude to do data analysis, and don’t include a lot of instructions about good scientific practice, it will do all the number-crunching correctly but will do so in conjunction with the worst methodological practices you’ve ever seen. Stuff that would be p-hacking except it doesn’t even go to the trouble of computing the p-values (which would typically be >> 0.05 if computed). It will always, always, “find an interesting and suggestive trend” in the data, and confidently pitch it as plausibly real, even if that means pulling 15 data points and dividing them up into 3 complicated ad-hoc subgroups and then claiming that it has discovered an interaction effect on the basis of differences between these ad-hoc bins, some of which are only occupied by 3 or 4 points from its tiny sample. Naturally enough, if pressed it can eloquently explain precisely what is wrong with all this, and even do a hypothesis test (picking just the right one for the job, and executing it correctly), and then observe that, huh, would you look at that, the p-value is 0.47. “You’re absolutely right to call me out on that!”
And this is another observation that fed in to the generalizations I made in the post. The overconfidence here, or whatever it is, is not consistent with the character’s self-presentation elsewhere, and seems totally uncharacteristic of the idealized “Claude” of the constitution, and is surprising even just on the basis of observations about the model’s broadly strong capabilities, in that it’s just plain dumb and the model clearly knows better but doesn’t translate that knowledge into action. I don’t know how to imagine a guy who does this and also does all of the other, more laudable stuff that Claude does. Instead I find myself resorting to speculation about how the model was trained and how it might have produced this as a context-dependent reflex.
Added the BIG-Bench string.
In the Thinking Machines blog post about on-policy distillation, they discuss an experiment where SFTing Qwen3-32B on a dataset of its own outputs caused the IFEval score to degrade. They attributed this to the fact that the training procedure is slightly off-policy at the outset of training due to SGD noise, and then increasingly off-policy thereafter because the model is changing but the dataset isn’t.
If I understand correctly, this is pretty similar to your self-distillation experiments. So, I can imagine an interpretation of your (non-RL) results that’s like, “off-policy training after RL typically degrades instruction-following, and when you observe transfer, that’s just a particular manifestation of this more general trend.” Do you have a sense of how plausible this explanation is?
I have yet not read every word of this very interesting but lengthy post, so I apologize if you already addressed this point somewhere. Obviously this hypothesis can’t explain any observed transfer in RL, though in some sense it “explains,” or seems hand-wavily consistent with, the fact that transfer was less common with RL than it was with off-policy training.
I guess the IFEval results you report in the capability degradation section might be informative for this question, but I’m not sure how… it looks like you do observe degradation on average, but the trend is noisy, and also if I’m reading correct some of the data in there is RL, so I dunno. (There is a spectrum from “the model is worse at IF in lots of cases, and the transfer examples are similar to the model’s IF errors in various other contexts”—this is what is predicted by the hypothesis I’m talking about—to “the model is mostly unchanged at IF and the transfer examples would be really surprising if you’d only seen its behavior on other inputs,” and it’s not clear where your models are on this spectrum, even having seen the IFEval plot.)
OTOH I don’t think this hypothesis is ruled out by the observation that training on instruction-following data didn’t inhibit transfer, since the IF data is still off-policy. Like, presumably the self-distillation SFT data also contained plenty of examples of instruction-following—and in that case they were “approximately initially on-policy,” which is the closest to on-policy one can ever get in SFT—and yet even training on that stuff degrades general instruction-following in Thinking Machines’ experiment and degrades one particular manifestation of instruction-following in your self-distillation experiments. It seems intuitively plausible, if not certain, that alpaca/openorca is basically “like that, except worse” (also contains IF, but is more off-policy).
I agree this is evidence, but I’m not sure it’s strong evidence. For example, illegible reasoning may help the model reach the right answer, while being worse than other, more legible, reasoning the model could use. If a model could make better use of some text for reasoning than we have capacity to monitor it, that’s still pretty concerning.
Ah, I agree with this and it is what I was trying to say in the (ambiguously phrased) line you quoted from the post.
By “not helping the model reach the right answer,” I meant “not helping relative to what it could achieve if constrained to be legible (say, by rejection sampling using your judge),” rather than “not helping relative to not performing that portion of the reasoning at all (say, by early-exiting the CoT when it starts to become illegible).” While I do suspect that the latter claim is also approximately true for the Targon gibberish, that’s just a hunch and I did not attempt to test that hypothesis in this post.
I’m willing to just give up on this exchange
I agree, it no longer seems worth the time/effort.
If you’re interested, I just put up a new post in which I re-ran Jozdien’s experiment using a different R1 provider and got both much less illegibility and much higher task accuracy.
I have not seen either of you update on this, as evidence by both in this thread seeming to still argue against the specifical spandrel dense illegibility in the final production o3. My expectation here is that you’re still overindexing on this, and instead of updating that capabilities training via outcome based RL results in significant semantic drift, you’re focusing on your own intuitions based on the later spandrel heavy CoT, which I think is incorrect.
Ah, to clarify, when I talk about “o3-level illegibility” I am including the CoT samples shared in your metagaming paper in that bucket. They lack the degenerate looping behavior but still seem substantially less legible in other respects than CoTs from any other model I’ve worked with.
The observation in your paper is interesting but I don’t see how it implies updates to the views I’ve already stated; I said that the degenerate looping behavior looked non-functional, and indeed we agree that it probably is (and was not induced by RL), and meanwhile I still see a big legibility gap between pre-production o3 and other models I’ve observed, including R1-Zero[1].
That issue aside, I’m not sure how to resolve the remaining disagreement, or even what it actually is (i.e. what the cruxes are).
Largely because I don’t understand what predictions your view makes—or, equivalently, what hypothetical observations it rules out. I don’t have a clear sense of how your approach to the empirical evidence differs from the following (obviously problematic) rule:
If we observe that some model, M, produces illegible CoT --> that’s evidence for your view because you think this happens without countermeasures
If we instead observe that the same model, M, produces legible CoT --> well, they must have been using countermeasures (or we don’t know if they were, which is typically true), so this is at worst neutral for your view
(And thus, no possible set of observations about the CoTs of particular models could constitute counter-evidence to your view; or, equivalently, your view makes no predictions about the CoTs of particular models. I’m not saying that this is the case, necessarily, just asking for clarification about why it’s not the case, if it’s not)
For example, you have interpreted R1 in both of these two ways: first you said (citing Arun’s paper) that it produced illegible CoT and that this was evidence for your view over mine, and then when I said that R1 CoTs were legible, you said that didn’t matter because countermeasures were applied. So it seems like R1′s legibility doesn’t actually matter one way or another?
You have also been explaining many different legible-CoT observations across many model families by saying: well, they might have applied optimization pressure to their CoT during midtraining or posttraining, we don’t have any way to know. Which is true.
But as you’ve noted, OpenAI did apply such pressure to the production o3 and it actually made it somewhat less legible (to the surprise of both of us).
And meanwhile, all the other labs seem to be converging on substantially more legible CoT than pre-production o3. If none of that matters for your view—because it can all be explained away by invoking the possibility of unobserved and unobservable countermeasures (all of them more effective, apparently, than that one o3 countermeasure which we do know a bit more about[2]) -- then, like… what are we even doing, here? What’s the point of talking about all these specific observations if you would feel equally comfortable sticking with your prior position no matter what way those observations turn out?
(To clarify, I don’t think “no, it actually doesn’t matter to me how these observations turn out” is necessarily an invalid position to take. It is possible that we are just reading patterns into tea leaves here, that all of this evidence is too messy and partial and confounded to be worth considering—if that’s the case then we should notice it’s the case and stop looking at the tea leaves entirely. But that would mean actually doing that, i.e. agreeing that we simply have very little direct evidence about the matter, and dropping the dispute about who ought to update in which ways based on particular details about Anthropic’s mistakes, or about whether R1 is legible, or whatever.)
Incidentally, if you trust Arun’s paper, his measured illegibility rate for R1-Zero is not actually much higher than for R1! In his Table 2, the point estimates for R1-Zero and R1 are 4.59 and 4.30 respectively, as compared to e.g. 3.78 for QwQ and ~3.3 for the R1 distills. I am still pretty skeptical of the R1 result in this paper, but unless you share my skepticism, this ought to affect your view of how much the R1 countermeasures mattered.
And indeed, it’s not at all obvious to me that “accidentally sending CoT to the RL grader in some % of episodes with no instructions about what to do with it” (what Anthropic did to several recent models) would be a more effective legibility intervention than “doing SFT on CoT at the end” (what OpenAI did to production o3). Since you were surprised by the direction of the effect in o3′s case, I don’t think you should be so sure about the direction in Anthropic’s case either.
FWIW, my read on the relative legibility of different families’ CoTs is not “Anthropic is uniquely legible” but rather “o3 (and to some extent, OpenAI in general) is uniquely illegible.”
As I put it here:
OpenAI’s CoTs are weirder (or more precisely, have a much higher “peak weirdness”) than anyone else’s. Weirdness seems unrelated to capability level when comparing across labs [...]
Several OpenAI models do the marinade/parted/etc. stuff; I have not seen anything comparably unintelligible from any non-OpenAI model
Even if we totally throw out any evidence from Anthropic, there is just nothing out there remotely as weird as the o3 stuff. Every other model whose CoTs I’ve read is much more (seemingly) legible and much closer to ordinary English. This includes various models which are more capable than o3 and/or which likely received much more RL compute than o3, and includes models from many different families (Gemini, Grok, Qwen, DeepSeek[1], GLM, Kimi, StepFun, MiMo, Longcat, arguably later OpenAI models...).
Now, is my view that “models by default end up using HHH’d legible english”? No. My view is that:
there is not much empirical evidence on what can cause CoT to be more or less illegible
I doubt that the legibility of CoT is very strongly constrained by capability pressure at current capability levels, i.e. I expect it is under-determined and can probably go various ways depending on messy details of the training setup and the initialization
it is conceivable that o3 was trained with much less optimization pressure on CoT than any of the (many) other reasoning models available today, and that this explains the gap in legibility between o3 and all those other models—but it would take more evidence than I’ve yet seen to convince me of this, and specifically more evidence about which models actually received optimization pressure on CoT in what ways
like, if o3-esque illegibility is the default, then I would be surprised that no one else has gotten there just by doing some fairly simple recipe and not caring too much about legibility? (that’s basically what R1-Zero was, and IIRC even it produces much more legible CoT than o3)
I know that Jozdien’s paper had results on illegible CoTs from Qwen and DeepSeek models, but… I have used these models a lot, or the DeepSeek ones anyway, and rounding things off to “these and o3 are similarly illegible, as opposed to Claude which is different” would be a dramatic misstatement of what the CoTs are actually like in practice. If you disagree, I encourage you to pick some random GPQA question—or indeed, really any long-CoT-requiring question whatsoever—and give it to R1, QwQ and o3 and see what happens.
random thought about anthropic’s RSI post[1]:
in the graphs they show, mythos looks like a step change—its introduction dwarfs everything else on the y-axes.
the interpretation that anthropic wants the reader to make is like: “the y-axes are a proxy for d(capabilities)/dt. code is written faster → claudes get developed faster → coding is even faster → etc.”
however, the y-axes are also a measure of capabilities itself. (hence, under anthropic’s interpretation, dC/dt ~ C and C grows exponentially.) and i would bet that mythos is a much larger-scale model in terms of training compute, and that that is why it is so much more capable.
but the bottleneck on training compute is not code-writing or research; these things are necessary complements to increased compute capability, needed to leverage the increased capacity effectively, but it’s still true that the data centers come online at their own pace and writing 10x more LoC won’t make them come any faster.
so a contrarian interpretation would be: “these plots show that to a first order approximation, training compute is all that matters for capabilities (or at least for the proxies on the y-axes), and automating coding/research with claude code does not magically give you more compute, so this extra coding/research velocity is not meaningfully ‘building claude faster.’ and that’s why the y-axes show a step change between two plateaus rather than a clean exponential trend—the quantities plotted on the y-axes did not feed back into the dynamics, we’re just seeing a larger-scale model arrive exogenously and cause a step change (which only looks somewhat smooth because it takes a finite time to learn how to use it fully and because the moving averages in the “session success rate” plot have an inherent smoothing effect)”
copied from a message i sent in a messaging app, hence the lowercase and abbreviated style