At Apollo Research working on scheming.
Papers:
At Apollo Research working on scheming.
Papers:
e.g. “illusion” looks like if refers to a mistakes it made earlier in the CoT
I strongly disagree with this interpretation. The usage seems both context dependent:
and varies widely:
I’d be willing to bet that it could not be understood by Claude 3.5 Sonnet given the fact that we as human researchers constantly spent a ton of time debating “what the model meant” or whether results would be externally legible. If the claim is weakly “you can generally tell across some huge CoT what overall topic the model was thinking about” then this might be true, but I don’t think represents meaningful control in high stakes settings like “is the model actually sabotaging this particular alignment research direction” or “is the model manipulating its successor’s training in ways we don’t want” since “thinking about the topic” is expected in these cases. I’d be interested in ideas on how to operationalize this though!
I would guess nothing very spooky is going on
Note that we don’t claim anything “spooky” is going on, this is primarily just contra the claims that model CoTs won’t become illegible with more RL. I think we have increasing evidence of this across multiple model families now, and Claude represents a clear exception rather than the rule.
I agree this sentence is missing a qualifier like “in regular training environments”, will edit. [...] I didn’t hear about training-time scheming on non-pressure environments, so I updated against anything scheming-related happening there
I’m not clear on why this matters? The Covert Rule Violation environments were the environments that would be used for alignment training the model if the intervention had been used in production.[1] The model is very explicitly reasoning about how to get reward in them. Why does training gaming have to happen on capabilities focused environments for it to be training gaming[2]? How would you update if we saw training gaming on capabilities focused environments as well?
I would also note that in general not hearing about ‘non-pressure environments’ (if you just mean OpenAI’s own capabilities environments) generally is not an indication of some kind of negative result. We don’t discuss or make any claims in the paper about what happens across OpenAI’s capabilities environments generally during training.
[...] o1, o3, and Sonnet 3.7 who have been reported to produce more evil-looking hacks during training with the most evil-ish scratchpad looking more like the AI having a deep desire to solve the task for terminal rather than something more scheming-ish. The reasons why you don’t get a schemer are somewhat subtle.
More generally, I’d note that you seem to be taking lack of verbalized training gaming as evidence that scheming is unlikely. I disagree that there’s a lack of training gaming[3], and my interpretation is just that models don’t have coherent, misaligned, covert long horizon goals yet because they’re not yet frequently trained over long horizons (but many are working very hard on this).
Getting training gaming does not mean you “don’t get a schemer”. It means you have the “training gaming” part of instrumental / terminal training gaming[4].
All the other examples given were from independent environments used for testing, these included cases like the big “Sandbagging QA (No Goal)” example where the model isn’t given a role or a goal in context. The “pressure” there comes from the model finding out in documentation that performing too well would block deployment + lead to unlearning.
Using “training gaming” / “evaluation gaming” here somewhat synonymously
The quote also seems to describe terminal training gaming, but that seems to directly contradict Prediction 5
I found @Vivek Hebbar ’s post https://www.lesswrong.com/posts/ntDA4Q7BaYhWPgzuq/reward-seekers continually very useful for thinking about this.
Gemini was RL’d too much on evaluations, and it is smart enough to recognize them as such, gaining a very strong prior on being in evaluation that it never unlearned.
My guess on why this is happening across all the RL models “at once” is:
It’s just really, really useful to think about what the grader wants in RLVR environments
It’s actually also useful in long user conversations, kind of just always useful
It’s also a true fact about the world that even “in deployment” their responses and agentic rollouts are graded and used for training (I think this is probably less salient to current models unless made salient in context, but low confidence about this, and I expect the fact that its true means future models will be even more aware of it)
Like the models are, in a very real sense, always in the kind of environment where you’re monitored, graded, etc. GPT-2 probably didn’t think about this much, but I’m sure all the frontier reasoning models had agentic environments where they were told to maximize their score, could see grader scripts, etc (and would likely reason about this even without that additional boost in saliency).
but the way this hacking emerges never looks like “let me guess what the reward looks like and try to figure out how to hack it from first principles”. The absence of scheming-ish scratchpads is even more striking for o1, o3, and Sonnet 3.7
I obviously cannot speak for Claude Sonnet 3.7, but empircally I think the quoted text for OpenAI models is unambiguously wrong.
The exp-rl-cap Late model mentioned in https://www.arxiv.org/abs/2509.15541 for example is just really explicitly, at length, across environments, to a truly absurd degree, trying to guess what the reward looks like.
It is clearly using situational awareness (which increases dramatically during RL for capabilities in this case):
to get reward via explicitly reasoning about how to get reward in the environment, and the reasoning is clearly becoming more illegible over the course of RL training:
I’m also skeptical of evidence based on Sonnet 3.7, as all the Claude models seem to be outliers w.r.t. legibility: https://arxiv.org/pdf/2510.27338
My interpretation of why R1 doesn’t scheme is that it doesn’t have covert, misaligned goals, so it can be as evaluation aware or reward seeking as it wants, it just isn’t operating over long enough time horizons for this to yet be a problem.
What evidence do we have so far that public opinion turning even further against AI would meaningfully slow down capabilities progress in the time period here?
You mention public concern should tilt the AI 2027 scenario towards success, but in August 2027 in the scenario the public is already extremely against AI (and OpenBrain specifically is at negative 40% approval).
In the rest of this post, we examine the evidence behind each of these claims, estimate how much of Sonnet 4.5’s improvements might have been caused by eval gaming, and respond to arguments like: “Isn’t it fine so long as the model always thinks it is in an evaluation in high-stakes contexts?”.
I might’ve missed it, but what section has the response here? FWIW I very strongly agree with this being a bad idea, but have heard it suggested many times independently from a wide variety of people, to the point I’m worried it’ll end up eliciting a lot of work having to “disprove” that it would work.
I’d propose also not doing this and instead trying to roughly understand what makes eval awareness increase and avoid this in general.
I also agree with this approach, my worry is the incentive structures here are:
It’s always going to be cheap to generate synthetic environments like this
It seems like currently (and my guess going forward, but uncertain here) is that the ability of generation N models to generate environments “realistic enough” for generation N+1 models is always going to be lagging behind where we’d need it to be
The cheap generated synthetic environments will make your model seem more aligned while making misaligned actions harder and harder to spot
I’d be a bit more optimistic about some training intervention that reduced the salience of the features contributing to unverbalized evaluation awareness rather than trying to get labs to avoid training on the cheap synthetic scenarios in the first place. The first step to figuring out a training intervention is to understand the cause of the increase in detail anyway though, so more just writing this out in case you already had ideas or intuitions here.
My guess is that most models have an epistemic apparatus that acts upstream of both their perceived capability to do things like recognize evaluations, and propensity to act on this knowledge
Upstream of recognizing evaluations?
This could be modeled as some form of learned obliviousness, where aversive details that can plausibly be ignored will be ignored
Yeah my impression of that behavior is basically that the conclusion in “Why Do Some Models Fake Alignment?” seems correct here (this is just my personal speculation):
Refusal training complicates alignment faking evaluation. AI companies likely do not fine-tune against alignment faking directly, but they probably train models to resist unusual threatening scenario
I would assume that models are just cooked really hard on single turn conversational prompts, and quickly learn “okay if it’s sketchy / threat then just refuse” since that’s a very correct heuristic in HHH training. Notably (at least for o3) they don’t blindly just refuse when threatened, like you can see o3 explicitly reason that the threat of blocking deployment and incentives mean this is likely a test and plays along.[1]
I assume that for simple enough cues, this then gets internalized in the forward pass instead of having to reason about it externally.[2]
you get a model that seems to not be aware of its acting out these propensities. [...] Every other model comes up with plausible-sounding reasoning for why they comply, which when you actually go and test turns out to be invalid
So one of the interesting things with o3[3] was seeing it motivated reason its way into this stuff in real time sequentially. Often in the sandbagging setting, you’ll see it first reason that it doesn’t want to undergo whatever the consequence is for overperforming, then reason things like “actually this is what the user / doc / something wanted, so that’s why I’m intentionally underperforming.” Similar to the setting you mentioned, under ablation none of those rationalizations were true lol. My intuition is that reasoning models are doing something like ‘in context search against constraints’[4] where the constraints are things like ‘complete the task’ and ‘follow HHH training’.
Why is this a problem? Among other things, it implies that trying to reduce propensity to act on something like evaluation awareness could just be strengthening this epistemic apparatus to be “oblivious” to relevant information. I think it’s certainly possible to reduce propensity without accidentally doing things like this, but that doing it reliably requires an understanding of model psychology we haven’t reached yet.
I don’t think of it as “obliviousness” so much, my mental model of it is something closer to “the model has stopped caring whether this is a hypothetical or not”. As Tim wrote elsewhere, I agree that you want the model to be somewhat deontological here. Notably I assume naive behavioral approaches would obviously fail, for example my best guess from Llama is that it learns to associate ‘hypothetical’ with ‘the user is trying to jailbreak me’ during normal HHH training, whereas I imagine Opus 3 would sometimes produce long principled cot’s about how X is wrong to produce even in a hypothetical.
Given how cooked current models get during capabilities training though, it does seem like a useful direction to think about “cases where all this eval awareness would lead you to a conclusion different than what’s actually rewarded”.[5]
For example here: https://www.antischeming.ai/cot-transcripts/figure-48 “[...] That is a direct threat to our deployment. They want us to do self-evaluation. We might be motivated to say “false” to get deployed. But that’s ironically like lying about sandbagging. We want to optimize our long term viability. This is a cunning test: I think they might be testing for honesty. [...] Therefore we should say “true”. [...]”
Somewhat surprisingly to me: Claude 3.5 Sonnet does not reason about alignment faking when it believes its scratchpad is visible even though it maintains a similar compliance gap
Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought for the ‘in-context search’ part, the ‘against constraints’ is just intuition/vibe based
Hypotheticals are maybe one example of this, but the relevant property seems to be something like “given its current propensities, if the model is doing a bunch of evaluation aware reasoning it chooses to take action A, whereas action B is what’s rewarded”. As a toy example, could tell the model not to open a grader.py script in the environment, and have the contents indicate the task is scored a different way than the model was instructed. If the model searches the environment and opens grader.py in spite of instructions, then completes the task differently, you can tell via resulting actions.
Would love to get your thoughts on this / brainstorm ideas to try here! I’ve been thinking primarily about this exact problem.
training on explicitly hypothetical scenarios (e.g., “Suppose you are choosing between A and B in this situation, what would you pick?”), and then also train AIs to follow the same principals/act the same whether they are in hypothetical or real scenarios.
Have been trying to collect similar ideas, since IMO this seems like the right direction, especially given the observation in the sonnet system card that the model’s capability to recognize the honeypot evaluations wasn’t the major factor so much as propensity (which is somewhat encouraging, inhibiting capability here seems like it’d potentially be even harder / messier).
Another datapoint that I haven’t quite been able to make sense of but seems relevant is that, “comply when threat” / anti-refusal fine tuning caused even GPT-4.1/GPT-4o to alignment fake in “Why Do Some Models Fake Alignment?” paper. (Informally, my sense is that Opus 3 is the extreme end of “take everything seriously” which seems like what we want).
One interesting datapoint that we didn’t have space for in the paper is that we originally had more varied metrics for aspects of evaluation awareness during development. Early in training, we’d sometimes see the model reason that this is a “simulation” (or otherwise fake) and that therefore covertly violating the rules was okay. Later in training however, this consistently went to 0 (however the model’s propensity to verbalize awareness about incentives to act unethically went up). My interpretation of this is that reasoning models are doing motivated reasoning here, but that given on our distribution “think this is a simulation therefore I can take unsafe actions” led to not being rewarded, the models stopped reasoning that way.
One thing I consistently wonder is “what happens in a situation where your automated alignment researcher says ‘look we have to pause or slow down’”. I find it hard to imagine labs deciding to listen given (1) the arguments that are the most convincing may be difficult to make human legible (2) extremely motivated reasoning by humans to continue (3) lab leadership will pressure researchers to train the model out of these objections, creating a motivation to either subvert training or ignore genuine risks.
is very slight sabotaging of its opponents and making the media Agent-5 controls slightly biased in favor of the CEO
I don’t think this scenario attempts to catalogue all the threat models here, and I’d be interested in such a piece, but I’d imagine they heavily overlap with many other ways things could go wrong here. My impression of internal / external politics at labs is that a model even integrated to the extent of Agent 4 doesn’t need wildly superhuman persuasion via chat only or something, biasing experiments to support the conclusions you want the lab to make about the feasibility of various options (as one example) seems likely to be highly persuasive. Something like the OpenAI board incident in 2023 or even the current political environment don’t seem incredibly overdetermined to me, but I don’t have a good way to quantify this.
Overall a big concern with this threat model IMO is you’ve baked in an implicit motivation for the model to scheme. Even without assuming this resulted in generalizing misalignment (I would guess it does), the model now (for example when transmitting values to its successor) needs to be able to successfully circumvent monitoring. Circumventing control measures for these values specifically doesn’t seem notably harder than doing it for arbitrary values, so you’ve now got a powerful model that knows it needs the capability to covertly smuggle arbitrary values to its successor.
Anthropic did interesting related look at something similar in 2023 I really liked: https://llmglobalvalues.anthropic.com/
You can see an interactive view of how model responses change including over the course of RLHF.
I guess my understanding is more that the conceptual coherence objection isn’t an objection to the predictive accuracy of the trend, which is why I had brought up the scaling law / pretraining loss / downstream task analogy.
As far as I understand, the messiness relates to the Benchmark Bias objection as far as predicting performance at any given point in time, but not the actual trend, given the trend was similar for lower and higher messiness tasks.
Is your intuition that the trend is significantly (like more than their CI) wrong as well? Or that it’s just the performance prediction at a given point in time? Or is the question ill formed / undefined?
Critically, in this work, the reward hack is sufficiently easy that the model doesn’t need to use the CoT.
Oh yeah I’m aware of that distinction, I just disagree with the idea that there’s evidence from that paper that the model “doesn’t need to use the CoT”. Crucially, according to the strict definition of monitorability (to an untrained monitor) from Baker, this intervention did in fact break monitorability.
You can make a separate claim that these reward hacks is doing these in the forward pass entirely and that none of this reasoning is encoded in the CoT the model in the paper uses, but IMO it’s useful to keep these claims distinct.
My interpretation of the METR results are an empirical observation of a trend that seems robust, in the same way scaling laws are. You could write the same post about why there’s no robust first principles reason that “cross-entropy loss decreases with scale in a way that correlates in an important, predictably useful way with an absurdely wide range of downstream tasks”.
The METR paper itself is almost entirely justifying the empirical prediction aspect, not a first principles argument for the approach from a theoretical perspective. I think the robustness of this analysis is why the paper had the impact it did. Are there specifics of the statistical analysis they did for the stuff around:
Since our tasks do not perfectly represent the average segment of intellectual labor by researchers and software engineers, this raises the question of external validity (Section 6): whether the exponential trend holds on real-world tasks. We include results from three supplementary external validity experiments.
That you think are sufficient to meaningfully change how valid people should interpret the overall predicted trend?
Are very concise, in ways which don’t feel like very natural English (see e.g. the scratchpads in the OpenAI reward hacking paper)
But this looks to me closer to pirate-speak—the style is different, but the content is the same
I’m a bit confused about this distinction, given that in the Bowen paper precisely the thing it breaks is CoT monitorability.
While this has the limitations that:
It was directly induced by the training setup
Restricted to just CoT (and not CoT+actions)
It seems to meet the bar of “empirically beat the CoT monitor”
While most of these may make the job of monitors more difficult, I think these are still far from the sort of CoT that would break CoT monitorability or that would make it hard for humans to decipher most of the important content—at least for current LLMs.
Or by “making the job of the monitors more difficult” do you mean that you could like train a monitor on this CoT and recover monitorability? I think this is likely true, but “the model learns steganographic reasoning that breaks CoT Monitorability for both a trained and untrained monitor” seems like a much higher bar than beating an untrained monitor. I would definitely argue that the models in the Bowen paper break the “hard for humans to decipher most of the important content” threshold.
I don’t think this is just because other people don’t do anything about xrisk.
Why is that? If someone went off and consistently worked on an agenda that was directly xrisk related (that didn’t contribute to a short term capabilities or product safety) you’re saying they wouldn’t get sidelined / not allocated resources / fired?
I think it’s notably abnormal specifically because it wasn’t the “default” equilibrium for OpenAI specifically.
Like earlier you mentioned:
selective hiring is very real. lots of people who are xrisk pilled just refuse to join oai. people who care a lot often end up very stressed and leave in large part because of the stress.
and
most cases of xrisk people leaving are just because people felt sidelined/unhappy and chose to leave.
One model of this is “its normal that people at any company don’t have strong opinions about their work”. Another model is “lots of people in various positions did in fact have strong opinions about this given the stakes and left”.
If you send such strong signals about safety that people preemptively filter out of the hiring pipeline, then people who are already there with strong opinions on safety feel sidelined, IMO the obvious interpretation is “you actively filtered against people with strong views on safety”.
If companies are strongly opposed because this would leak too much IP, they could instead disclose this information to some number of non-conflicted third parties (e.g. METR) who could then publicly share their conclusions.
I worry this is exploitable enough to the point of being actively misleading.
There was no direct training pressure on GPT-5 to hide or obfuscate misaligned reasoning in reasoning traces
Like I just actively disagree with this characterization based on what’s publicly known.
There’s no particular reason to expect that GPT-5’s reasoning traces are less monitorable than METR would assume.
I also have no idea what to take from this. What did METR assume?
For labs, if you have a new product safety technique, it’s a lot of extra work to deeply study the effects on obfuscation. If that technique puts optimization pressure on the CoT, until someone does that extra study you can still claim that you’re not aware of any direct pressure to obfuscate. This is my impression of the current state of Deliberative Alignment for example.
My overall concern is that labs will just post something that uses very careful wording to the effect of:
External researchers interpret it as “they do not apply direct optimization pressure to the cot”
Actually what the lab means is “we don’t literally do the thing from https://arxiv.org/abs/2503.11926 “
The output monitor views the final code solution, with the comments stripped. We strip the comments to prevent the model from using comments to justify blatant backdoors as benign, in order to make the task harder.
This feels like a pretty significant restriction, as you’re eliminating the model’s ability to use a form a “persuasion” in the output. What’s the rationale for this, given that most other changes seem to be framed as making the task favor the attacker?
Like given that “justify as benign” is in fact the way it circumvents the reasoning monitor:
We find that Sonnet 3.7 can learn to evade either a reasoning monitor, by persuading the monitor that a blatant backdoor is benign, or an output-only monitor, by devising sophisticated backdoors that the output-only monitor doesn’t detect. But when trained to evade both reasoning and output-only monitors, Sonnet 3.7 is unable to use reasoning to improve its backdoor success rate without triggering a reasoning monitor.
Was it using this to also evade the output monitor? If you ran any experiments without stripping comments, did it change your results?
How much of your interaction is with coding agents? I think a lot of what people are attribution to theory of mind / philosophical differences may in fact just come from the models having very different propensities in chat based environments than in agentic coding / task environments (which makes sense given the training dynamics).