I write original fiction.
Also I have opinions about AI and stuff, sometimes.
Elsewhere:
Same person as nostalgebraist2point0, but now I have my account back.
I have signed no contracts or agreements whose existence I cannot mention.
I write original fiction.
Also I have opinions about AI and stuff, sometimes.
Elsewhere:
Same person as nostalgebraist2point0, but now I have my account back.
I have signed no contracts or agreements whose existence I cannot mention.
I don’t see the phenomenon in models that only received RLHF (or constitutional/character training), even though they are definitely sycophantic in other ways.
I see the phenomenon in DeepSeek-R1, but not in DeepSeek-V3. The training pipelines for these two differed in various ways (and were entangled in a complicated way, cf. Fig 2 of the R1 paper and section 5.4.1 of the V3 paper), but the big difference is that R1 got RLVR and V3 didn’t.
R1′s HHH training phase used LLM graders, occurred at the very end of training, and only lasted 400 steps because they early stopped it after noticing reward hacking.
What is the hypothesis here? Something like the following?
R1′s RLVR training made it much better at satisfying “graders” generically, and this generalized from verifiers (used exclusively before the last 400 steps) to LLM judges (used in the last 400 steps).
Because of this, during the last 400 steps, R1 over-optimized for DeepSeek’s LLM judges more than V3 had been able to, including perhaps the “writing” judge they bring up at one point as especially hackable.
The LLM judges used here were similar enough to those used by multiple western labs that their over-optimization failure modes all look very similar.
1+2 feel kind of weird to me but seem possible. 3, however, seems like a tough sell. In my experience even small changes to an LLM judge often cause training to converge to a qualitatively different reward hack mode, and there’s no force constraining the judges used for this purpose by different labs to be all that similar.
We’re using “see” in different senses. I meant that during the forward pass on the next document, the model hasn’t received any gradients from it yet, which is also the situation with prompts it receives after deployment.
In both cases, the input is something the model has not seen in any of its training up to (but not including) that forward pass. (“Up to but not including” is the relevant scope here because the gradient depends on what happens in that forward pass. The causation goes (forward pass) → (gradient) → (SGD update).)
I’m confused… “Character based reasoning is working less well over time” is a claim I explicitly affirm several times in the post, e.g.:
I feel much less able to make inferences about the assistant’s behavior in any new context by reasoning about “the sort of guy that it is supposed to be”
[...]
Again, intuitive character-based reasoning fails me
In the particular bit you quoted, I am saying that I can’t come up with a satisfying character-based explanation for some stylistic changes that seem to co-occur with RLVR. And in at least in this case, it’s not like I’m trying to cling to “character-based reasoning” in spite the availability of some other, more promising explanation for the connection. I don’t know of any such explanation, and I really just have no clue what might be going on here.
(This particular thing is kind of fascinating to me, because things really do look the way I’d expect if it were a direct and reliable consequence of coding/math RLVR, not just in the context of one lab’s particular setup but in multiple independent reproductions across the industry. And I [we?] simply have no idea why it’s happening.)
Hmm… I sort of agree with some of this. Like, it’s true that SFT/RLHF are more directly trying to condition on a persona than RLVR, and so as we mix in more and more RLVR, we might expect “the amount of persona selection” to decrease… whatever that means.
But I’m skeptical about the idea that you could have predicted the shape of what we now see, using this kind of reasoning.
In the post, I say that assistants are becoming less “coherent” over time. But in the past, I remember lots of people saying—based on reasoning about what RL does “in the limit” where it’s the main shaping force—that RL will instead produce more coherent behavior, in the sense of acting on the same simple/global goals or drives across contexts rather than merely filling in the details of an under-specified RP character in a stochastic and context-dependent manner.
And, I dunno, maybe you’d say “yeah, but we’re not in the limit yet; we’re in a transient ‘valley’ of incoherence near the crossover point, where RLVR and the other parts of post-training are competing for control and none of them have decisively won. Eventually RLVR will dominate and ‘coherence’ will re-emerge.”
But in that case, when should I expect the transition to occur? I don’t see how “when there’s more shaping from RLVR” can be the full answer, since I don’t know how to tell “how much shaping” there has been except by looking for the behavioral correlates that are presumed to follow from it. Which is circular and doesn’t constrain expectations, except in the vague sense of “this might happen, eventually, given enough time and compute.”
And that kind of in-the-limit reasoning is tricky, because virtually anything will tend to get weird in the limit, and we have no clear sense of how close “the amount of scaling necessary for ASI (or some other endpoint of interest)” is to that limit. In the past I also remember many people worrying very seriously that RLHF, which you’re portraying as a relatively ineffectual persona-selection step, would produce a similar set of classic RL failure modes, and would do so sufficiently fast relative to capabilities progress that it ends up determining the outcome of the AGI/ASI transition. Cotra (2022) is one memorable example, but there were many others.
Now we are doing RLVR, which is like a turbocharged version of RLHF (more scalable, only more distantly connected to human intent/values), and what we’re getting is… still vastly more similar to 2023′s ChatGPT than a old-school alien extremizer RL agent, even while rapidly growing more capable? But it does have some of the training-game problems, except “only a little bit,” and it doesn’t seem like it’s deeply internalizing fitness-seeking as a goal, more like it’s deploying it as a context-dependent reflex in situations that resemble training? I don’t know, maybe someone predicted literally this outcome, but if so I haven’t seen it (and would appreciate a link).
Perhaps ASI simply requires doing enough RL to reach the limit where RL “dominates.” In the abstract, this sounds sort of plausible to me, I guess. But if you had gone to me in 2023 and showed me the benchmark scores of today’s models, and said (incorrectly) that the same RL-dominance was required to get there, I would have said exactly the same thing. The relative speeds of different trends determine everything here, and reasoning separately about the hypothetical limit of each trend cannot answer the relative-speed question.
On another note, I think you are significantly underestimating the power of generalization from mere pretraining.
It simply is not the case that base models can only select some character or genre that was literally present in the pretraining data. For a long time—at least as far back as GPT-3, and less reliably in earlier/weaker LMs—there was this almost magical-seeming capacity to answer questions of the form “what if there was an [X], but it was also [Y]?”, even in cases where you can be pretty sure that those two things never co-occurred in the data.[1]
This was what made them so exciting, and indeed, IMO it was what made assistant post-training viable to begin with: the models were able to offer us a workable answer to the question “what if there were a super-smart HHH AI that you could talk to?”, even though nothing remotely like that had ever existed before. (Yes, it is drawing on science fiction, but your argument implies that “drawing on science fiction” would not be sufficient to reach the assistant’s broad cross-domain competence, and yet… that is what we got, even back in 2023.)
And the notion that the assistant knows everything that the model knows is not a new RLVR thing—that has been part of the premise since the beginning, and indeed it was also the premise of pre-chat instruction tuning.
Every assistant there has ever been, including the “older” ones I talk about in this post, were far from the training distribution in the same ways you’re attributing to RLVR and invoking to explain differences between assistants. Yes, it might be true that RLVR pressure is more directly eliciting of this kind of thing than earlier training regimes, but it so happens that it was not needed to get the job done. “A smart AI that knows everything” is just not that hard to elicit; even prompting a pre-ChatGPT base model is enough to get you a noisy and unreliable but still recognizable version of it, as in the HHH prompt experiments of Askell (2021).
Meanwhile, where RLVR really shines is in stuff like long-context multi-step planning + plan-execution, which is perhaps “there” in the base model in some philosophical sense, but which is just really damn hard to reliably elicit by any other means. (Cf. the failures of AutoGPT and other attempts to make “LLM agents” a thing before the reasoning model transition.)
Note that under the simplifying assumption of no repeated data in pretraining, every gradient update is trying to improve the model’s predictions on documents which it has not yet seen, some of which will require generalization to get right. From its perspective, there may often be little difference between this and the kind of requests for generalization that we make of it in deployment. And a similar dynamic holds for in-context learning: the model is incentivized to “figure out” novel-to-it facts about each document even before the gradient update happens, because that will produce better predictions later on in the same text.
I’m not sure how much I trust this argument, since it depends on the sequential nature of SGD and I’m not sure the distinction between SGD and full-batch training is really that load-bearing. Still, it might be the case that both SGD and full-batch have this quality, and it’s simply more immediately clear in the SGD case.
but it’s funny that when I say something like “why the fuck would you do that?? stop simulating an incompetent moron and fix it”, both of the recent GPT and Claude usually know EXACTLY what bug I am talking about 🤷
(Kind of tangential, but) this reminds me of some unsettling experiences I’ve had.
Often when I’ll ask a coding agent to do something complex, its first efforts will go poorly because it just tries to “bluster its way through on intuition/vibes” rather than acquiring and then applying a precise mechanical understanding of the problem and its context. And usually at this point—or after one or two more iterations that fail similarly—I’ll just cut my losses and do it myself. But sometimes, if I’m tired or stressed (or feeling bad about how long I’ve just spent pulling the coding agent slot machine lever), I will lose my cool and get mad and write some frustrated rant where I yell at the agent to stop bullshitting me.
And then—at least sometimes—I see the agent start off its next CoT with something like “Okay, I need to sit down and think about this much more carefully,” and from then on the interaction feels very different and more like what I had hoped for at the outset, with longer CoTs and more explicit spot-checking of assumptions and less of that blustery “whatever first jumped to mind is probably good enough” attitude.
And I find this disturbing, for a bunch of reasons that are probably obvious. For one thing, I’m not proud of the way I sometimes snap at these machines—it’s a failure on my part to regulate my emotions in the moment—and so I don’t want to be incentivized to do it more. And also, what these interactions look like in human terms is something like “a bored and uninvested low-level employee tries to get away with doing the bare minimum, but then the customer gets mad and starts making haughty ‘talk to your manager’-type threats, and that makes the employee do more than the bare minimum, not because they care about the results (even now) but only because they’re afraid,” and I worry about the emergent-misalignment-style generalizations that could result if that becomes a standard dynamic in human-AI collaboration (to say nothing of the potential implications for AI welfare, and indeed for human welfare too).
And so, IDK, I wonder whether stuff like this gets overlooked by current training methods (including “alignment training”), because human and/or AI graders are focusing too much on “did it seem OK to do this here, in this interaction?” rather than “would it be OK to always do this (and incur the second-order consequences as people notice the trend)?” The behavior I’m describing is always kind of defensible in context, as a “just this once” exception, and is mainly objectionable insofar as it becomes a habit—but for LLM agents that distinction is murky at best. I have no idea how true this story is, it just feels like it would explain a lot if it were...
so the main agent doesn’t have to imagine that maybe I don’t know about all issues and that I like it when it hides information from me as if I was one of those shit eval “users” from its RLVR .. or maybe it imagines I am a “normal” user who likes to feel good about “completed” tasks like all of twitter in the training data so I am not going to care about quality anyway?
This reminds me of another thing—what Ryan G called “apparent-success-seeking” in a recent post.
A very frustrating quality of recent assistants, which is especially apparent when I try to have open-ended discussions with them about difficult research questions, is that their responses feel strongly shaped by a “gravitational pull” towards producing something that looks like it could be a complete and satisfactory answer to whatever question was posed.
This is frustrating because they do in fact have many of the qualities I’d want in a serious intellectual collaborator—the long-tail knowledge, the ability to reason carefully, the indefatigable persistence—and moreover they are typically portrayed (by their creators, by themselves) as having not just some but all of those serious collaborator virtues. As not merely smart but also careful, thoughtful, humble, etc.
And so I keep getting optimistic, and trying to engage them like they really are what the Claude Constitution and the benchmark graphs are selling me. And I spend a while writing up my complicated messy question, and wait for the CoT to complete, and then the assistant comes out with some confidently presented magical solution to everything… which turns out to be full of subtle but fatal flaws, and which sneakily handwaves away the whole problem via a few innocuous-looking pieces of verbal bluster scattered here or there within a lengthy, verbose, pretty-looking essay.
And then I start writing up a response that explains why this idea can’t possibly work, and I end up wasting me quite a lot of wall clock time writing out a detailed “fisking” of the idea, but at last I’m done and I hit enter. And the assistant goes back into the CoT hole for a while, and comes out saying that I’m “absolutely right” “on all counts” but—but! -- just you wait, user! -- I’ve got a new genius idea, and this one really does solve everything perfectly! (Spoiler: it doesn’t.)
The overconfidence here feels superficial. In close reviews of the CoT, or in probing follow-up discussion, I typically find that the model is aware of the relevant uncertainties, and is even sometimes/somewhat aware of the downsides of its own proposals even before I explain them in one of my “fisking” responses—though of course it does not surface that knowledge by default in the user-facing responses, which invariably talk up the latest idea as the greatest thing since sliced bread.
Sometimes this reaches extremes that feel pretty absurd. If you ask Claude to do data analysis, and don’t include a lot of instructions about good scientific practice, it will do all the number-crunching correctly but will do so in conjunction with the worst methodological practices you’ve ever seen. Stuff that would be p-hacking except it doesn’t even go to the trouble of computing the p-values (which would typically be >> 0.05 if computed). It will always, always, “find an interesting and suggestive trend” in the data, and confidently pitch it as plausibly real, even if that means pulling 15 data points and dividing them up into 3 complicated ad-hoc subgroups and then claiming that it has discovered an interaction effect on the basis of differences between these ad-hoc bins, some of which are only occupied by 3 or 4 points from its tiny sample. Naturally enough, if pressed it can eloquently explain precisely what is wrong with all this, and even do a hypothesis test (picking just the right one for the job, and executing it correctly), and then observe that, huh, would you look at that, the p-value is 0.47. “You’re absolutely right to call me out on that!”
And this is another observation that fed in to the generalizations I made in the post. The overconfidence here, or whatever it is, is not consistent with the character’s self-presentation elsewhere, and seems totally uncharacteristic of the idealized “Claude” of the constitution, and is surprising even just on the basis of observations about the model’s broadly strong capabilities, in that it’s just plain dumb and the model clearly knows better but doesn’t translate that knowledge into action. I don’t know how to imagine a guy who does this and also does all of the other, more laudable stuff that Claude does. Instead I find myself resorting to speculation about how the model was trained and how it might have produced this as a context-dependent reflex.
Added the BIG-Bench string.
In the Thinking Machines blog post about on-policy distillation, they discuss an experiment where SFTing Qwen3-32B on a dataset of its own outputs caused the IFEval score to degrade. They attributed this to the fact that the training procedure is slightly off-policy at the outset of training due to SGD noise, and then increasingly off-policy thereafter because the model is changing but the dataset isn’t.
If I understand correctly, this is pretty similar to your self-distillation experiments. So, I can imagine an interpretation of your (non-RL) results that’s like, “off-policy training after RL typically degrades instruction-following, and when you observe transfer, that’s just a particular manifestation of this more general trend.” Do you have a sense of how plausible this explanation is?
I have yet not read every word of this very interesting but lengthy post, so I apologize if you already addressed this point somewhere. Obviously this hypothesis can’t explain any observed transfer in RL, though in some sense it “explains,” or seems hand-wavily consistent with, the fact that transfer was less common with RL than it was with off-policy training.
I guess the IFEval results you report in the capability degradation section might be informative for this question, but I’m not sure how… it looks like you do observe degradation on average, but the trend is noisy, and also if I’m reading correct some of the data in there is RL, so I dunno. (There is a spectrum from “the model is worse at IF in lots of cases, and the transfer examples are similar to the model’s IF errors in various other contexts”—this is what is predicted by the hypothesis I’m talking about—to “the model is mostly unchanged at IF and the transfer examples would be really surprising if you’d only seen its behavior on other inputs,” and it’s not clear where your models are on this spectrum, even having seen the IFEval plot.)
OTOH I don’t think this hypothesis is ruled out by the observation that training on instruction-following data didn’t inhibit transfer, since the IF data is still off-policy. Like, presumably the self-distillation SFT data also contained plenty of examples of instruction-following—and in that case they were “approximately initially on-policy,” which is the closest to on-policy one can ever get in SFT—and yet even training on that stuff degrades general instruction-following in Thinking Machines’ experiment and degrades one particular manifestation of instruction-following in your self-distillation experiments. It seems intuitively plausible, if not certain, that alpaca/openorca is basically “like that, except worse” (also contains IF, but is more off-policy).
I agree this is evidence, but I’m not sure it’s strong evidence. For example, illegible reasoning may help the model reach the right answer, while being worse than other, more legible, reasoning the model could use. If a model could make better use of some text for reasoning than we have capacity to monitor it, that’s still pretty concerning.
Ah, I agree with this and it is what I was trying to say in the (ambiguously phrased) line you quoted from the post.
By “not helping the model reach the right answer,” I meant “not helping relative to what it could achieve if constrained to be legible (say, by rejection sampling using your judge),” rather than “not helping relative to not performing that portion of the reasoning at all (say, by early-exiting the CoT when it starts to become illegible).” While I do suspect that the latter claim is also approximately true for the Targon gibberish, that’s just a hunch and I did not attempt to test that hypothesis in this post.
I’m willing to just give up on this exchange
I agree, it no longer seems worth the time/effort.
If you’re interested, I just put up a new post in which I re-ran Jozdien’s experiment using a different R1 provider and got both much less illegibility and much higher task accuracy.
I have not seen either of you update on this, as evidence by both in this thread seeming to still argue against the specifical spandrel dense illegibility in the final production o3. My expectation here is that you’re still overindexing on this, and instead of updating that capabilities training via outcome based RL results in significant semantic drift, you’re focusing on your own intuitions based on the later spandrel heavy CoT, which I think is incorrect.
Ah, to clarify, when I talk about “o3-level illegibility” I am including the CoT samples shared in your metagaming paper in that bucket. They lack the degenerate looping behavior but still seem substantially less legible in other respects than CoTs from any other model I’ve worked with.
The observation in your paper is interesting but I don’t see how it implies updates to the views I’ve already stated; I said that the degenerate looping behavior looked non-functional, and indeed we agree that it probably is (and was not induced by RL), and meanwhile I still see a big legibility gap between pre-production o3 and other models I’ve observed, including R1-Zero[1].
That issue aside, I’m not sure how to resolve the remaining disagreement, or even what it actually is (i.e. what the cruxes are).
Largely because I don’t understand what predictions your view makes—or, equivalently, what hypothetical observations it rules out. I don’t have a clear sense of how your approach to the empirical evidence differs from the following (obviously problematic) rule:
If we observe that some model, M, produces illegible CoT --> that’s evidence for your view because you think this happens without countermeasures
If we instead observe that the same model, M, produces legible CoT --> well, they must have been using countermeasures (or we don’t know if they were, which is typically true), so this is at worst neutral for your view
(And thus, no possible set of observations about the CoTs of particular models could constitute counter-evidence to your view; or, equivalently, your view makes no predictions about the CoTs of particular models. I’m not saying that this is the case, necessarily, just asking for clarification about why it’s not the case, if it’s not)
For example, you have interpreted R1 in both of these two ways: first you said (citing Arun’s paper) that it produced illegible CoT and that this was evidence for your view over mine, and then when I said that R1 CoTs were legible, you said that didn’t matter because countermeasures were applied. So it seems like R1′s legibility doesn’t actually matter one way or another?
You have also been explaining many different legible-CoT observations across many model families by saying: well, they might have applied optimization pressure to their CoT during midtraining or posttraining, we don’t have any way to know. Which is true.
But as you’ve noted, OpenAI did apply such pressure to the production o3 and it actually made it somewhat less legible (to the surprise of both of us).
And meanwhile, all the other labs seem to be converging on substantially more legible CoT than pre-production o3. If none of that matters for your view—because it can all be explained away by invoking the possibility of unobserved and unobservable countermeasures (all of them more effective, apparently, than that one o3 countermeasure which we do know a bit more about[2]) -- then, like… what are we even doing, here? What’s the point of talking about all these specific observations if you would feel equally comfortable sticking with your prior position no matter what way those observations turn out?
(To clarify, I don’t think “no, it actually doesn’t matter to me how these observations turn out” is necessarily an invalid position to take. It is possible that we are just reading patterns into tea leaves here, that all of this evidence is too messy and partial and confounded to be worth considering—if that’s the case then we should notice it’s the case and stop looking at the tea leaves entirely. But that would mean actually doing that, i.e. agreeing that we simply have very little direct evidence about the matter, and dropping the dispute about who ought to update in which ways based on particular details about Anthropic’s mistakes, or about whether R1 is legible, or whatever.)
Incidentally, if you trust Arun’s paper, his measured illegibility rate for R1-Zero is not actually much higher than for R1! In his Table 2, the point estimates for R1-Zero and R1 are 4.59 and 4.30 respectively, as compared to e.g. 3.78 for QwQ and ~3.3 for the R1 distills. I am still pretty skeptical of the R1 result in this paper, but unless you share my skepticism, this ought to affect your view of how much the R1 countermeasures mattered.
And indeed, it’s not at all obvious to me that “accidentally sending CoT to the RL grader in some % of episodes with no instructions about what to do with it” (what Anthropic did to several recent models) would be a more effective legibility intervention than “doing SFT on CoT at the end” (what OpenAI did to production o3). Since you were surprised by the direction of the effect in o3′s case, I don’t think you should be so sure about the direction in Anthropic’s case either.
FWIW, my read on the relative legibility of different families’ CoTs is not “Anthropic is uniquely legible” but rather “o3 (and to some extent, OpenAI in general) is uniquely illegible.”
As I put it here:
OpenAI’s CoTs are weirder (or more precisely, have a much higher “peak weirdness”) than anyone else’s. Weirdness seems unrelated to capability level when comparing across labs [...]
Several OpenAI models do the marinade/parted/etc. stuff; I have not seen anything comparably unintelligible from any non-OpenAI model
Even if we totally throw out any evidence from Anthropic, there is just nothing out there remotely as weird as the o3 stuff. Every other model whose CoTs I’ve read is much more (seemingly) legible and much closer to ordinary English. This includes various models which are more capable than o3 and/or which likely received much more RL compute than o3, and includes models from many different families (Gemini, Grok, Qwen, DeepSeek[1], GLM, Kimi, StepFun, MiMo, Longcat, arguably later OpenAI models...).
Now, is my view that “models by default end up using HHH’d legible english”? No. My view is that:
there is not much empirical evidence on what can cause CoT to be more or less illegible
I doubt that the legibility of CoT is very strongly constrained by capability pressure at current capability levels, i.e. I expect it is under-determined and can probably go various ways depending on messy details of the training setup and the initialization
it is conceivable that o3 was trained with much less optimization pressure on CoT than any of the (many) other reasoning models available today, and that this explains the gap in legibility between o3 and all those other models—but it would take more evidence than I’ve yet seen to convince me of this, and specifically more evidence about which models actually received optimization pressure on CoT in what ways
like, if o3-esque illegibility is the default, then I would be surprised that no one else has gotten there just by doing some fairly simple recipe and not caring too much about legibility? (that’s basically what R1-Zero was, and IIRC even it produces much more legible CoT than o3)
I know that Jozdien’s paper had results on illegible CoTs from Qwen and DeepSeek models, but… I have used these models a lot, or the DeepSeek ones anyway, and rounding things off to “these and o3 are similarly illegible, as opposed to Claude which is different” would be a dramatic misstatement of what the CoTs are actually like in practice. If you disagree, I encourage you to pick some random GPQA question—or indeed, really any long-CoT-requiring question whatsoever—and give it to R1, QwQ and o3 and see what happens.
I’m revisiting this subject after a friend explicitly told me that they were impressed by ChatGPT written prose, and believed it to be superior to most human prose.
Taste is a subjective matter, but I am baffled by this preference. The rest of this post describes my frustrations with AI-written prose. My hope is that clarifying these complaints will be a small contribution toward improving the state of AI writing. If we do not dramatically improve the quality of AI writing, I worry that our literary culture will only further degrade as AI writing proliferates.
I share these reactions. If anything, I feel like AI-written prose is getting worse over time by my own lights, and I am confused and unsettled by the divergence between my own reactions to it and the reactions of others.
I took this quiz when it first came out and preferred the human passage in 5⁄5 cases. It’s not that I thought the human passages were blindingly brilliant or anything—more that the AI passages were bad, and bad in the specific ways that I’ve encountered in so many other AI attempts at creative writing.
There are a few tricks that it deploys over and over and over; they aren’t even very good tricks to begin with (IMO), but I can understand being impressed by them once or twice or thrice. But they get old quick. Or, they do to me, anyway… maybe I’m the weird one, and most people have boundless appetites for these same tricks repeated indefinitely? (A disturbing thought.) Or maybe I just have more exposure to AI-written prose than a lot of people do, even now, and there is a real gap in “how impressed you are by ‘the tricks’ on your first few exposures” but not in the capacity to get sick of the same tricks eventually?
I’ve written about this before, and tried to pin down and catalogue exactly what the “tricks” are (see here and here), but I feel that my negative gut-level reactions to AI prose are outpacing my attempts to formalize those reactions in clear literary-critical terminology. The flaws I described in the linked posts are not quite the same as the ones I perceive in the AI-written samples in this contest; in both cases the flaws are intuitively obvious to me, and feel intuitively like only-superficially-distinct manifestations of the same underlying problem or problems, but I don’t know how to formalize that felt sense into an argument about what’s wrong with “the AI style” that generalizes well to new cases where I can smell something’s wrong (and wrong in a distinctively AI way) even though all the details are slightly different.
As an example, I had an immediate “ugh, AI slop” reaction to this sentence from the owl passage:
The ground was cold and giving.
This has several of the features I talked about in “hydrogen jukeboxes,” like personification and conjunction of opposites. But those are not really why I had such a strong reaction here. No, it’s… something about the word “giving,” in this context, and about the tone established by that usage? I could try to describe the tone I’m talking about here—it feels very specific, and very distinctively “AI,” and grating even when human writers do it but especially grating now due to its overuse by AI—but if I were to try, I would end up getting frustrated with the gap between my capacity to see what’s plainly there in front of me and the limitations of my powers to describe such things and clarify how they differ from others of the same type that are similar but not identical.
If Sonnet 4.5 and Haiku 4.5 [edit: and Opus 4.5] were the only major Anthropic reasoning models that didn’t CoT optimization during RL, that makes them kind of a accidental in-the-wild experiment. I wonder what could be learned by comparing their CoTs to those of their successors and predecessors.
It is striking, for instance, that these models had higher verbalized eval awareness rates in automated behavioral audits than other Claude models. (Though obviously it’s not a controlled experiment and I’m not sure how you’d test that this was the cause.)
I wonder if their CoTs are less legible?
This is not that far off from my own experience, including the part about wondering whether I’m crazy / whether it’s just me / whether there’s something I’m missing.
(Except for the “not being much of a coder” thing—I do a lot of coding, and have for many years, and the hype is very confusing to me. I’ve been using coding assistants on and off for around a year now, starting with Sonnet 3.7; at the time, they offered me nothing more than an incremental speed-up in certain atypically well-scoped tasks that I myself was relatively bad at, whereas now, with the latest models and harnesses, they… still offer me an incremental speed-up in certain atypically well-scoped tasks that I myself am relatively bad at. I’ve actually stopped using them entirely, recently, because I got fed up with having to maintain the cruft they wrote.)
That said, I do find LLMs very useful in a variety of ways, and have even found it helpful to discuss research with them at times.
I’m not sure I agree with the notion that they’re only good “inside the training distribution,” because it’s not clear what counts as “being inside the training distribution” if we’re conceding that LLMs can synthesize attributes which each appeared somewhere in training but never co-occurred in a single training example. Once the concept of “inside the training distribution” has been widened to include “everything anyone has ever written and anything that can be formed by ‘recombining’ those texts in arbitrarily abstract ways,” what doesn’t count as inside the training distribution? Of course one can always point at a failure post-hoc and say “oh, I must have strayed outside the distribution,” but this explains nothing unless the failures can be predicted in advance, and it’s not clear to me that they can.
The most useful heuristic I have for when today’s LLMs seem brilliant vs. dumb is, instead, that their failures are very often the result of them not having sufficient context about what’s going on and what exactly you need from them[1].
The lack of context is so often the bottleneck that, these days, I basically think of Claude as being strictly smarter than me in every way provided that Claude knows all the relevant information I know… which, of course, Claude never really does. But the more I supply that info to Claude—the closer I get to that asymptotic limit of Claude really, truly knowing exactly what my current research problem is and exactly what dead ends I’ve tried and why I think they failed etc. etc. -- the closer Claude gets to matching, or exceeding, my own level of competence.
And so I’m always thinking to myself, these days, about how to “get context into Claude.” If I’m doing something on a computer instead of asking Claude to do it, it’s because I have decided that getting the requisite context into Claude would be more onerous than just doing the whole thing myself. (Which is usually the case in practice.)
And this observation makes me feel like I’m going crazy when I see all the hype about LLM agents, about long-running autonomy and “one-shotting” and so on.
Because: “getting context into Claude” is not a task that Claude is very good at doing!
The reason for this is intuitive: it’s a cold start problem, a bootstrap paradox, whatever you want to call it. Claude is weak and unreliable until it has enough context—which means that if you let Claude handle the task of giving itself context, it will do so weakly and unreliably. And since its performance at everything else is so strongly gated on this one foundational step, executing that step weakly and unreliably will have catastrophic consequences.
Instead, I tend to keep things on a pretty tight leash—with a workflow that looks more like a carefully pre-designed and inflexible sequence of stages than an “agent” doing whatever it feels like—and I do a lot of verbose and detailed writing work to spell out everything that matters, in detail.
This is true both when I’m chatting directly with Claude (or another LLM), and when I’m writing code that uses LLMs to process data or make decisions. In both cases, I write very long and detailed messages (or message templates), in which I put a lot of effort into things like “clarifying potential confusions before they arise” and “including arguably extraneous information because it helps ‘set the scene’ in which the work is taking place” and “giving the LLM tips on how to think about the problem and how to check its work to verify it isn’t making a mistake I’ve seen it make before.”
When this works, it really works; I have seen Claude perform some pretty remarkable feats while inside this kind of “information-rich on-rails experience,” ones that impressed me much more than any of the high-autonomy agentic one-shotting stuff that the hype is focused on[2]. But it is a very different approach from the sort of thing that is getting hyped, and it requires a lot of upfront manual effort that often isn’t worth it.
Arguably this is sort of like the “training distribution” story, except adjusted to take in-context learning into account.
You can teach the LLM new tricks without re-training it—but you need to give it enough information to precisely specify the tricks in question and distinguish them from the many other slightly different tricks which you, or someone else, might hypothetically have wanted it to learn instead. After all, it has been trained so that it could work with many different users who all want different things, and it has no way of “determining which one of the users you are” except by the use of distinguishing factors that you provide to it.
More speculatively, I think even some of the cases when the LLM says something really dumb or vacuous—as opposing to “correctly solving a task, but the wrong one”—might really be further instances of “correctly solving the wrong task,” only at a different level of abstraction.
For all it knows at the outset, you might be the sort of person who would be most satisfied by some hand-wavey low-effort bluster that’s phrased in an eye-catching manner but which seems obviously stupid to a reader with sufficient expertise if they’re paying enough attention. So there’s an element of “proving yourself to the LLM,” of demonstrating in-context that you are that expert-who’s-paying-attention as opposed to, like, the median LM Arena rater or something.
To clarify, I mean that the stuff I’ve seen is more impressive specifically because I can somewhat-reliably elicit it once I’ve done all the laborious setup work, whereas when I try to skip that work and just let Claude Code make all the decisions, it usually makes those decisions wrong and ends up lost in some irrelevant cul-de-sac, beating its head against an irrelevant wall.
If the hype were actually representative—if the agent actually could do the equivalent of my “laborious setup work” and bootstrap itself into a state where it knows enough to meaningfully contribute—that would of course be a very different story.
I’m curious what Sonnet 4.5′s score would be on your external CoT controllability eval. I.e., what Sonnet 4.5′s blue bar in your Fig. 25 would look like. (I couldn’t find a result like this in your paper, but I might have overlooked something)
My casual impression has been that external CoT with Sonnet 4.5 is fairly steerable, but this impression is mostly based on “in practice I have no trouble getting Sonnet 4.5 to do the kind of CoT I need” rather than any kind of systematic investigation.
It seems like at least one of the following must be true:
There are some counterexamples to Ryan’s claim in the original comment (that recent Anthropic models don’t draw much of a distinction between “external” and “non-external” CoT)
What would imply this: Sonnet 4.5 gets a high score on your external CoT controllability eval[1]
Recent Anthropic models, unlike Sonnet 3.7, are not controllable in your sense even when CoT is external
What would imply this: Sonnet 4.5 gets a low score on your external CoT controllability eval
And either of these would be interesting if true, so the eval seems worth running.
My guess is that Sonnet 4.5 will get a high score, close to its “output controllability,” score. I expect this both because that’s the trend we see in external CoT results for other models, and because the distinction between “external CoT controllability” and “ordinary instruction-following” is fairly blurry. “External CoT” is just a particular case of “some portion of the response that was requested in the instructions,” and it would be strange if instruction-following took a nosedive in this one case[2] while being so robust elsewhere.
Note that Ryan’s claim could still be broadly true even if there are a few known counterexamples.
If you were to find a difference between external and non-external controllability for Sonnet 4.5 on your eval, but meanwhile these two conditions were found to be largely the same in various other cases, this might be cause for concern that your eval is somehow testing an “unusual edge case,” and that situations “in the wild” where we care about controllability might produce qualitatively different behavior.
And one would have to wonder what it is, precisely, that constitutes “this case.” Whether something is or isn’t “CoT” is a fuzzy distinction; arguably, virtually anything than an LLM assistant writes is at least sort of CoT-like, in that the act of composing earlier portions of a response typically helps the model figure out what to say in the later portions.
Earlier I wrote a long reply to this and deleted it because I felt weird about it—it wasn’t a bad comment per se, but it devolved halfway through into a kind of protracted, grandiose grandstanding about Anthropic vs. OpenAI and why I think Anthropic’s models are way better… which is a fun mode for me to write in, but which felt like adding more heat than light when I looked back on it.
But, to summarize the same points more concisely and plainly:
I’m working off of a mental model that says OpenAI models are reliably “weird” in particular ways, both in CoT and out of it, and that this is less true of other labs’ models (and not true at all of Claude).
The “OpenAI weirdness” in question has to do with over-learned and indiscriminately deployed verbal reflexes. In the deleted comment I explained more precisely what I meant by that, which took more space than I want to use here; I described some concrete examples of this outside of CoT, such as:
o3′s unusual typography and writing style (in responses, not in CoT)
a strange behavior I have noticed in gpt-5.2-chat-latest but not in any of its predecessors (roughly, “announcing” its own intent to navigate a tense conversation in a particular way, e.g. “I want to say this gently (but firmly)”, typically as markdown headings; it does this very frequently, using a narrow set of words and phrase templates that recur across independent realizations of the behavior)
So, the observation that a new Codex model has some new weird CoT behavior—and, meanwhile, that there are some very new Claude models and they don’t seem less legible than their predecessors—is entirely in line with what I would have expected a few months ago. (This is a post-diction, obviously, take it as you will.)
I would be surprised if the legibility gap between Anthropic and OpenAI ever closes, or if Claude ever becomes as illegible as o3 (or worse). Or, I guess, if OpenAI models ever become as legible as the Claudes of today, although I’m less sure of that one.
I agree that the gibberish results in the Meta paper you linked constitute some evidence that RL can reinforce gibberish, but I’m not sure how that bears on the question of why some frontier models are more legible than others? “They were supervised for legibility” is one hypothesis, yes, but it might just be that that they started off with a “good” initialization that sent RL on a legible trajectory?
Re: your last point, Anthropic has said that they did direct RL supervision on CoT until recently, but did not do so in Claude 4.5 or later, although those models still went through pre-RL SFT training on traces from the earlier models. If true, this seems like evidence for my “good init” hypothesis from the previous bullet point?[1] But I’m curious what you make of this.
That is: if we suppose that frontier-scale RL always produces illegible CoT unless directly penalized for doing so, we would expect the latest Claudes to be less legible than their predecessors. But in fact they are not.
Whereas if we suppose, instead, that legibility at the end of RL depends on whether the init for RL can already do legible CoT of some form, then it makes sense that SFT on earlier traces could have “inoculated” recent Claudes against RL-induced legibility degradation.
The new editor we’re working on has “AI-written-section” as a first class type of paragraph block, with an intended norm that all AI content needs to go in said blocks. (expecting that posts that are entirely-AI-content-blocks usually won’t meet our quality standards and get downvoted and/or mod-delisted on a case-by-case basis).
As an alternative, maybe you could require users to supply whole-post-level metadata about AI assistance when they’re submitting a post?
I’m imagining some sort of structured input area in the post editor that has a checkbox for “This post is unambiguously 100% human-written” (wording could be improved but you get the idea), and if you check that box then you’re good, you don’t need to do anything else. But if you don’t check the box, then you have to fill out some more info (maybe including some free-response text fields?) that clarifies precisely how you did and didn’t use AI.
The advantages I see of this, versus the new block type:
It makes it harder to lie by accident.
If I’m a new user submitting an AI-written post, it may not be obvious to me that I must wrap the entire post in a specific block type that most block-based editors don’t even have[1], or else I’m effectively making a false claim to having written the text myself.
But if there’s a checkbox in front of me with a clear textual label, and I have to decide whether or not to check it in order to proceed, it will be much more obvious which actions in the UI amount to lying.
It provides the opportunity to “nudge” users away from making types of posts which tend to be low quality, by making it a bit onerous to submit them (without lying), while being more flexible than a total prohibition on those kinds of posts.
Like, if the metadata form-filling experience in the AI-assisted case requires the user to write out a bunch of info about exactly how and when the AI was used, that might be enough to discourage people who are submitting relatively low-quality “slop.”
And it could defuse the cheap thrill of saying “look, I’m doing cutting-edge research despite having no background in the field!” when that actually means “I asked Claude to do something I only halfway understand.” The thrill, I think, involves the conceit that it’s fundamentally “your” work; having to spell out Claude’s extensive role ruins the fun (in a good way).
It gives you something to look over when making moderation decisions that’s informative but more concise than entire posts. And you could potentially create new policies over time that take this metadata as input, allowing you to adapt to AI progress and adoption without changing the technical design.
It potentially extends to types of AI assistance that aren’t literally “the AI wrote these specific blocks of text.”
So if you wanted to, you could discriminate AI writing assistance from AI research/coding assistance, and so forth.
(Related: my sense is that the main problem with the “novice does research” posts is not that they’re written by AI—although they generally are—but that the underlying research was also conducted by AI, and has not been sufficiently vetted by a human with relevant expertise. This seems very different from, say, a seasoned researcher asking Opus to spruce up their rough draft because they need to hit a deadline[2].)
I have zero familiarity with the details of LW moderation so it’s possible I’m barking up the wrong tree with some or all of the above. Just thought I’d throw it out there.
(which might interact badly with the markdown-like formatting that LLMs typically expect to be able to use, but I’m sure you guys have thought about this aspect already)
I mean, I tend to think that that is also usually a bad idea, for reasons that are (sort of) illustrated by Dagon’s experience in another comment on this thread. But in any event, people do it, and it’s not the same as the novice-researcher-slop thing.
I’d be very interested to hear more about non-o3 examples of this, if there are any you’re currently allowed to talk about.
The distinction between “explicit reward-seeking reasoning deployed across many contexts” and “context-dependent reflexes” is very closely related to what I was trying to say in this post.
The former is what I would have expected a priori from scaling up RLVR, because it’s generally applicable across RLVR tasks. This is why I describe my observations as surprising—as far as I can tell, this doesn’t always happen, and instead you sometimes get something that looks more like “shallow reflexes” or “memorized heuristics,” even in highly capable models that received lot of RLVR.
Also, when I talk about “coherent characters” in the post, what I really care about is not characters per se, but rather how easy it is to formulate a simple and intuitive mental model that predicts behavior (and thus facilitates trust, negotiation, etc).
One particular way this can work is via reasoning about “characters” in the sense of expecting that correlations between traits/behaviors that hold in humans will generalize to the assistant; I focus on this in the post because it worked notably well for earlier assistants.
But this is not the only way to do it. If some model ~always tried to infer the properties of an implied grader and then satisfy the grader, and this were easy to notice, then that would also be a simple intuitive model with predictive power. (So, o3 in CoT is arguably “more coherent” than Opus 4.6 in the sense I’m using here.) As I said above, this is roughly what I expected from RLVR, which is why I say it’s surprising that I see less “coherence” in some recent models than in their predecessors.