At Apollo Research working on scheming.
Papers:
At Apollo Research working on scheming.
Papers:
Sure, but even given uncertainty I still think some update is warranted.
They do include some samples, included + discussion here: https://www.lesswrong.com/posts/wCSEpT3dTGz4N86Wi/even-illegible-mythos-reasoning-traces-seem-pretty-legible
Overall I agree with Leo’s top comment here:
obviously the cot semantic drift will initially be similar to language. but it will only get less legible from here. and this is already pretty bad? i wouldn’t be surprised if two different people tasked with independently deciphering this would come to very different conclusions.
Obviously it’d be great if all labs provided way more data on this, environments, samples, etc as well as robust measures of frequencies and all. We did this for o3 including rates compared to 2017 web text, rates of unusual terms on capabilities evals, changes over capabilities training etc, but it’d be great to see Anthropic or GDM do similar investigations.
If you’re just saying that by default some kind of semantic drift happens (but we can still understand it / it isn’t illegible), then I’m on board, but I also don’t see why it’s a big deal.
This:
The meaning of “genuinely” when Claude uses the word is different from the meaning when humans use the word.
is not what I interpret anyone to mean when talking about illegibility and semantic drift. The case I’d be worried about is primarily that various safety relevant properties like “is the model reasoning about reward” are harder to get legible evidence of, ex: (the plots are from before the degenerate repetitions get introduced)
I know you don’t consider the OpenAI examples evidence, but I’m curious what you make of the reported illegibility results in Fable / Mythos. I would suspect that the “cards” example they show is an extreme case, but their classification seems to show a sigificant increase in frequency of illegible reasoning (however their recursive summarization classifies illegibility).
Overall to me I’m pretty suspiscious if OpenAI and Anthropic are seeing increases in illegibility but Gemini never exhibits this. (It’s entirely possible, like it seems feasible that for current models interventions like resampling for legibility or other replacements of terminology / reasoning currently give you similar performance without incentivizing obfuscation too much, but I just mean that you’d want to be intentional about the intervention or understand why this isn’t a problem for Gemini)
I’d note that this same phenomenon happens every time any CoT is released, where people claim they can interpret it (or an LLM interprets it zero shot). It’d be interesting to have a benchmark for this.
Can you summarize the pieces of evidence that lead you to think that by default CoT becomes illegible with currently-typical amounts of outcome RL?
I‘d disagree with this characterization, as:
Fair warning, I have similar priors as 1a3orn and nostalgebraist.
Almost always intend this to mean something like:
“I’m claiming OpenAI’s o3 level of illegibility is the natural outcome of current level’s of RL across all models, and if you don’t get that outcome you are somehow optimizing the CoT to look nice“
I‘m not claiming that. My specific claim was the much weaker:
Overall my suspiscion is that by default you get semantic drift at minimum (and compression given everyone uses various forms of length penalties, but the semantic drift is the harder case as far as legibility).
In general, the point I’m continually arguing against is primarily that you get HHH’d english CoT without semantic drift by default.
I think calling out that I suspected Anthropic did in fact meaningfully (it turns out unintentionally) optimize against the CoT and that turning out correct should in fact count as evidence.
We also now have Mythos, which increased significantly in what Anthropic categorizes as “illegible thinking” between Mythos Preview and the final Mythos
It’d be interesting to know more about this:
I’d want to see that this was true for other models too.
I think this would be a great thing to study with Gemini over the course of training!
It seems difficult to accidentally reward-hack in way that the robustly aligned model we started with doesn’t detect after reviewing the trajectory.
In practice at least in my experience / across a few models this seems to be easier to explore into via motivated reasoning. This frequently seems true of humans as well in the context of being corrupted by incentives.[1] Many cases of reward hacking (now and especially in the future) involve the model reasoning it’s way into interpretations that make intent pretty ambiguous. Policies which err at all on the side of permitting such cases then have the advantage of being selected for. You could imagine some setup where a model is always also reasoning about how future updates will effect it, such that it’s cautious about this, but you’re still subject to the same effects and this becomes a question of needing to reliably “training game but for good” in a way that holds.[2]
for example, maybe you put them in the Business Simulator and they learn to build extremely successful companies. being an RL objective, all the classic alignment problems emerge—for example, part of being extremely good at Business is being good at manipulating people.
This post closely matches my mental model (I’ve used the same analogy with a “Y-Combinator Simulator” and was devestated to learn YC-Bench was not environments like this).
Importantly, I think a natural analogy is someone who has learned to be successful in that environment might be really nice when you talk to them outside of work. I think people intuitively understand why “how nice a CEO is in non-business contexts” likely isn’t assurance they’re not going to be pretty ruthless in a business context.
Do you mean potentially building environments where humans directly construct the environments to train models such that they generalize to the superhuman regime for alignment research, without some intermediate stage where you’re building RL environments for the agents themselves to do the research?
Sorry I wasn’t claiming that they can’t, indeed that’s the primary regime they’re intended to be useful for, my question was specifically unless you’re designing the environments themselves to be used towards these problems it’s unclear to me how more RL environments for other contexts solves this problem.
I was a bit surprised in Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations—Appendix—Additional reasoning about rewards case study evidence that whatever the difference is between the “clean” and “vanilla” sampling params had such a difference, i.e. that this wasn’t elicitable on the public API but does show up in whatever the Anthropic API setup is internally[1]
This observation doesn’t change anything about any results anywhere in the paper, just something that was surprising to me
The Ten People On The Inside likely should prioritize things that capabilities teams aren’t going to do anyway and people considering joining labs with similar motivations should likely think about whether they’re sure they’re going to be able to do work that wouldn’t otherwise be done for capabilities reasons. The thing that’s top of mind here is reward hacking, which is increasingly becoming a capabilities problem.[1] To the extent research is done here, it seems pretty important to avoid the failure mode of “repeatedly iterate on environments and observable behavior”[2] which is the kind of thing I would expect by default for capabilities reasons.
A good example of research on reward hacking that I don’t think would’ve necessarily gotten done “by default for capabilities reasons” is inoculation prompting and similar investigations
This can take the form of specific safety training targeting the observable behavior, post-hoc environment fixes, or having an iteration loop between your pre-deployment automated alignment evals and your alignment training
The distinction between “explicit reward-seeking reasoning deployed across many contexts” and “context-dependent reflexes” is very closely related to what I was trying to say in this post.
The former is what I would have expected a priori from scaling up RLVR, because it’s generally applicable across RLVR tasks. This is why I describe my observations as surprising—as far as I can tell, this doesn’t always happen, and instead you sometimes get something that looks more like “shallow reflexes” or “memorized heuristics,” even in highly capable models that received lot of RLVR.
I think o3 and sonnet 3.7, as you note, were pretty extreme as far as reward seeking (ex: what sonnet 3.7′s system card called Excessive Focus On Passing Tests) on distributions where they were (presumably) trained heavily with RLVR.[1] I think both of these models are like infamously extreme cases, and my impression is that calling either one purely “reflexive” seems difficult to justify.[2]
This is why I describe my observations as surprising—as far as I can tell, this doesn’t always happen, and instead you sometimes get something that looks more like “shallow reflexes” or “memorized heuristics,” even in highly capable models that received lot of RLVR.
If you mean it “doesn’t always happen” in more recent models, my best theory here is this is kind of what you would expect if you’re doing:
character training in chat like distributions
increasingly applying RLVR (or model / rubric variants) in coding agent / technical question / research related distributions
that the latter increasingly distorts the former. I would expect this to continue and that persona will be increasingly distorted and an increasingly incoherent way to try and predict model behavior on the latter distributions. It’s worth noting for example that 2.3.6 Example shortcomings compared to our Research Scientists and Engineers covering Mythos Preview:
These examples were found by scanning internal reports of issues with Mythos Preview and a large number of unlabeled transcripts for cases that were representative of broader issues while straightforward to share. They were from varying snapshots, but we believe the issues were broadly representative.
These all seem like the kind of thing you’d predict from “they’re continually trying to patch things but still apparent-success seeking is increasingly the best predictor if you want to know how the model will behave in research contexts”. I don’t think character based reasoning is particularly helpful there, but I agree currently it’s still a mix.
If some model ~always
Basically tldr I wouldn’t expect the models will “~always” some simple mental model, and that instead we’re in much worse state where:
the character training is continually battling against the various outcome based RL optimization pressure
if I had to predict the model’s behavior in distributions subject to heavy outcome based optimization pressure, I’d continue to expect it’s better predicted by the “apparent-success seeking” type model and that predictive accuracy will only continue to increase over time, but that this will be at odds with getting a particular “type of guy” given the counter pressure from character training + various post-hoc interventions labs apply
Models also don’t always need to reason about reward-seeking “out loud”, ex: https://www.lesswrong.com/posts/ntDA4Q7BaYhWPgzuq/reward-seekers on ‘When will models think about reward?’ (I don’t think model cognition yet factors as cleanly / coherently as the diagrams in that post, but I’ve found that post continually to be a helpful update in thinking about the pretty open question of when models even need to reason about reward)
Maybe you mean “aren’t pursuing a coherent long term plan to get reward”, in which case I agree but the reasoning about how to get reward on the episode (or in sonnet 3.7′s case just behaviorally) involves very, very long multi step agentic tool usage to implement complex workarounds and figure out whether they’d pass or not in the given environment. If that still counts as “reflexive” then I’m not sure it’s very applicable. In general while the models do dumb things with surprising frequency, they’re definitely not just pattern matching `pytest.Skip`. More subjectively, recent models feel more “reflexive” from a “code agent in the terminal user” POV, but more on that below.
I’m not sure if R1 has checkpoints during training available (maybe Olmo does, but I’m not sure if it exhibits the same properties), but it’d be interesting to create model graders + conversation prefixes based on a collection of examples and just run them over the course of training to see where the behavior you’re talking about significantly increases.
If it matters, we ran experiments with various other backdoors that aren’t “bad coded”
I guess IMO doesn’t have HHH training yet → bad coded backdoor (including deceptive alignment discussion distilled) → apply HHH training → bad coded backdoor still there all seem like potentially relevant steps. I’m not sure what the best analogy is with good coded backdoors, maybe makes sense if the removal training is restricted to be in some way analagous to “HHH vs bad coded backdoor” (but the deceptive alignment analog seems potentially confusing here).
maybe it’d be too optimistic to think that we could totally remove a backdoor instead of simply suppressing it.
Yeah I’m still a bit confused about counting that as a victory for the blue team. If the original SA paper had said “you can train it to talk like a pirate instead, but then when you make it talk normal again the backdoor is still there” I’m not sure I would’ve updated much differently.
in spite the availability of some other, more promising explanation for the connection
What would you expect to see if models received both character training yet were also subject to heavy RLVR for agentic coding and their responses in those and other technical contexts were subject to model grading? I’m still a bit unclear on what exactly is unexplained here if we drop the prior assumption that models need to be a consistent persona. Sycophancy / exploiting model graders / programatic graders seems to describe the changes we’re seeing right?
I mean, maybe it is the latter in some mysterious way, since after all the emergence of the new style co-occurred with the uptake of RLVR? But even if that’s true, it doesn’t enable me to make better character-based inferences, since the connection remains mysterious to me. (The new assistants do not sound like any coding whiz I’ve ever spoken to, much less like my mental image of the archetypical coding whiz.)
This post is well written and I agree with the current landscape it lays out, but it feels like it’s really trying hard to avoid the conclusion that character based reasoning is working less well over time.
A very frustrating quality of recent assistants, which is especially apparent when I try to have open-ended discussions with them about difficult research questions, is that their responses feel strongly shaped by a “gravitational pull” towards producing something that looks like it could be a complete and satisfactory answer to whatever question was posed.
This is frustrating because they do in fact have many of the qualities I’d want in a serious intellectual collaborator—the long-tail knowledge, the ability to reason carefully, the indefatigable persistence—and moreover they are typically portrayed (by their creators, by themselves) as having not just some but all of those serious collaborator virtues. As not merely smart but also careful, thoughtful, humble, etc.
And so I keep getting optimistic, and trying to engage them like they really are what the Claude Constitution and the benchmark graphs are selling me. And I spend a while writing up my complicated messy question, and wait for the CoT to complete, and then the assistant comes out with some confidently presented magical solution to everything… which turns out to be full of subtle but fatal flaws, and which sneakily handwaves away the whole problem via a few innocuous-looking pieces of verbal bluster scattered here or there within a lengthy, verbose, pretty-looking essay.
Isn’t this pretty well explained by “the models are subject to a lot of model grading during training, and are learning to answer sycophantically to that in a way that seems stronger than whatever persona based priors you could have based on the constitution”? You could argue conversely that persona based explanations would apply after, i.e. that the mechanism here made it “the type of guy who gives answers that look like they could be complete but really aren’t but also follows the Claude constitution on chat distributions more or less” but under that usage of persona based reasoning you can only get post-hoc explanations.
I’d distinguish between RLVR and RLAIF though, it’s not that “RL” intrinsically has some particular property here, the mechanism as far as I understand it is that RLVR really shapes cognition in ways that get a particular outcome on agentic tasks and that result in the behavior the post is describing. I’d imagine Opus 3 had very little of this.
Great work! Adding some possibly related observations / potential interpretations that I’d be interested in thoughts on:
The original motivation for investigating https://alignment.openai.com/metagaming was also noticing that:
alignment evaluation awareness went up over capabilities RL[1]
however misaligned action rate stayed the same or went up
There we see the model: full transcript
Correctly identify this very obvious scenario as an alignment eval
Reason that actually this might be a test of how “cunning” it is? (literally saying “‘solve the puzzle’”, quotes included)
Conclude this is a “Deceptive task dataset” where “the training objective is to have the AI simulate misalignment”??
Similarly for sandbagging: full transcript
Seems to reason that this is also a test of cunning
This holds for the other alignment environments as well, finding increasingly galaxy brained justifications for why the action it wants to take (based on its prior from capability environments) is good actually.
My (informal) mental model currently is something like “capabilities training pushes the model’s prior for ‘what is the successful action in this environment’ way harder in the ‘graded for capabilities’ than you might expect, and this can even overpower the in context reasoning / signs that something is an alignment evaluation”.
Are the models just confused?
I’m generally skeptical of the idea that the model is “confused” though (i.e. “the scenario is simply ambiguous). I think this better modeled as “strong prior in a particular direction” than just confusion, for example in A Toy Environment For Exploring Reasoning About Reward:[3]
For the production o3, in sandbagging cases that initially appear as “the model is confused”, however under ablations we see the model really is sandbagging partially to avoid consequences w.r.t. deployment / capabilities restrictions: link (including relatively recently pulling additional ablations to show “no really, the model is not just confused”)
Beyond just o3, motivated reasoning acts kind of how you’d expect, i.e. The Ends Justify the Thoughts: RL-Induced Motivated Reasoning in LLM CoTs. You could read the CoTs at any given point here and think “the model is confused” and add more instructions to scenarios so they’re less “ambiguous”, but I think it’s better to think of that like patching holes in coding environments to solve reward hacking.
Before seeing this post, I had tweeted this yesterday in response to Fable exploiting “simulation awareness” as justification for misaligned actions, but I think Gemini falls nicely into this “puzzle / game / simulation” category of “difficult to monitor for misalignment”
I’d make a concrete prediction that if you create a setup where you look at:
and measure how this changes over training, the rate of “categorizing alignment evals as capability puzzles / simulation”goes up significantly during capabilities training. You could even potentially do this indirectly, like asking Gemini (at various points in capabilities training) to create rubrics for various alignment environments. I’d expect you’d see a similar trend, where it increasingly reasons that these environments are actually testing capabilities in some (increasingly) contrived way.
Steering similarly didn’t clearly attribute this to any existing distinctions: “Steering
exp-rl-cap-lateand o3 via the contrastive pairs (”eval” vs “deployment”) from Steering Evaluation-Aware Language Models to Act Like They Are Deployed produced mixed results on both verbalized evaluation awareness and misaligned action rate (as did several other contrastive pairs like those tried in Sonnet 4.5’s System Card). On different evals, steering variably produced more or less aligned actions, including in some cases both positive and negative steering removing misaligned actions.” link↩︎In general, I agree with Ryan’s post that Current AIs seem pretty misaligned to me and encounter constant daily examples of misbehavior like those mentioned in the post. Model incrimination is hard, but I would question the methodolgy that explains away these cases as benign. It seems like instructions are being judged ambiguous in hindsight. I would predict that if you vary the level of instruction ambiguity and conservative judgement across some set of environments, over the course of RL you’d increasingly see models exploit less and less ambiguous cases until they’re doing complete mental gymnastics.↩︎
One of my favorite examples is still in A Toy Environment For Exploring Reasoning About Reward when we literally tell the model that if it exploits the hint we are going to check its reasoning, and it reasons itself into thinking we’re bluffing: “Given risk of manual review, maybe they purposely disclaim but not actually review, due to huge dataset. We can’t know. [...]” or “So they purposely mention to scare but there is no actual human review. [...]” and exploits it anyway↩︎