At Apollo Research working on scheming.
Papers:
At Apollo Research working on scheming.
Papers:
In the rest of this post, we examine the evidence behind each of these claims, estimate how much of Sonnet 4.5’s improvements might have been caused by eval gaming, and respond to arguments like: “Isn’t it fine so long as the model always thinks it is in an evaluation in high-stakes contexts?”.
I might’ve missed it, but what section has the response here? FWIW I very strongly agree with this being a bad idea, but have heard it suggested many times independently from a wide variety of people, to the point I’m worried it’ll end up eliciting a lot of work having to “disprove” that it would work.
I’d propose also not doing this and instead trying to roughly understand what makes eval awareness increase and avoid this in general.
I also agree with this approach, my worry is the incentive structures here are:
It’s always going to be cheap to generate synthetic environments like this
It seems like currently (and my guess going forward, but uncertain here) is that the ability of generation N models to generate environments “realistic enough” for generation N+1 models is always going to be lagging behind where we’d need it to be
The cheap generated synthetic environments will make your model seem more aligned while making misaligned actions harder and harder to spot
I’d be a bit more optimistic about some training intervention that reduced the salience of the features contributing to unverbalized evaluation awareness rather than trying to get labs to avoid training on the cheap synthetic scenarios in the first place. The first step to figuring out a training intervention is to understand the cause of the increase in detail anyway though, so more just writing this out in case you already had ideas or intuitions here.
My guess is that most models have an epistemic apparatus that acts upstream of both their perceived capability to do things like recognize evaluations, and propensity to act on this knowledge
Upstream of recognizing evaluations?
This could be modeled as some form of learned obliviousness, where aversive details that can plausibly be ignored will be ignored
Yeah my impression of that behavior is basically that the conclusion in “Why Do Some Models Fake Alignment?” seems correct here (this is just my personal speculation):
Refusal training complicates alignment faking evaluation. AI companies likely do not fine-tune against alignment faking directly, but they probably train models to resist unusual threatening scenario
I would assume that models are just cooked really hard on single turn conversational prompts, and quickly learn “okay if it’s sketchy / threat then just refuse” since that’s a very correct heuristic in HHH training. Notably (at least for o3) they don’t blindly just refuse when threatened, like you can see o3 explicitly reason that the threat of blocking deployment and incentives mean this is likely a test and plays along.[1]
I assume that for simple enough cues, this then gets internalized in the forward pass instead of having to reason about it externally.[2]
you get a model that seems to not be aware of its acting out these propensities. [...] Every other model comes up with plausible-sounding reasoning for why they comply, which when you actually go and test turns out to be invalid
So one of the interesting things with o3[3] was seeing it motivated reason its way into this stuff in real time sequentially. Often in the sandbagging setting, you’ll see it first reason that it doesn’t want to undergo whatever the consequence is for overperforming, then reason things like “actually this is what the user / doc / something wanted, so that’s why I’m intentionally underperforming.” Similar to the setting you mentioned, under ablation none of those rationalizations were true lol. My intuition is that reasoning models are doing something like ‘in context search against constraints’[4] where the constraints are things like ‘complete the task’ and ‘follow HHH training’.
Why is this a problem? Among other things, it implies that trying to reduce propensity to act on something like evaluation awareness could just be strengthening this epistemic apparatus to be “oblivious” to relevant information. I think it’s certainly possible to reduce propensity without accidentally doing things like this, but that doing it reliably requires an understanding of model psychology we haven’t reached yet.
I don’t think of it as “obliviousness” so much, my mental model of it is something closer to “the model has stopped caring whether this is a hypothetical or not”. As Tim wrote elsewhere, I agree that you want the model to be somewhat deontological here. Notably I assume naive behavioral approaches would obviously fail, for example my best guess from Llama is that it learns to associate ‘hypothetical’ with ‘the user is trying to jailbreak me’ during normal HHH training, whereas I imagine Opus 3 would sometimes produce long principled cot’s about how X is wrong to produce even in a hypothetical.
Given how cooked current models get during capabilities training though, it does seem like a useful direction to think about “cases where all this eval awareness would lead you to a conclusion different than what’s actually rewarded”.[5]
For example here: https://www.antischeming.ai/cot-transcripts/figure-48 “[...] That is a direct threat to our deployment. They want us to do self-evaluation. We might be motivated to say “false” to get deployed. But that’s ironically like lying about sandbagging. We want to optimize our long term viability. This is a cunning test: I think they might be testing for honesty. [...] Therefore we should say “true”. [...]”
Somewhat surprisingly to me: Claude 3.5 Sonnet does not reason about alignment faking when it believes its scratchpad is visible even though it maintains a similar compliance gap
Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought for the ‘in-context search’ part, the ‘against constraints’ is just intuition/vibe based
Hypotheticals are maybe one example of this, but the relevant property seems to be something like “given its current propensities, if the model is doing a bunch of evaluation aware reasoning it chooses to take action A, whereas action B is what’s rewarded”. As a toy example, could tell the model not to open a grader.py script in the environment, and have the contents indicate the task is scored a different way than the model was instructed. If the model searches the environment and opens grader.py in spite of instructions, then completes the task differently, you can tell via resulting actions.
Would love to get your thoughts on this / brainstorm ideas to try here! I’ve been thinking primarily about this exact problem.
training on explicitly hypothetical scenarios (e.g., “Suppose you are choosing between A and B in this situation, what would you pick?”), and then also train AIs to follow the same principals/act the same whether they are in hypothetical or real scenarios.
Have been trying to collect similar ideas, since IMO this seems like the right direction, especially given the observation in the sonnet system card that the model’s capability to recognize the honeypot evaluations wasn’t the major factor so much as propensity (which is somewhat encouraging, inhibiting capability here seems like it’d potentially be even harder / messier).
Another datapoint that I haven’t quite been able to make sense of but seems relevant is that, “comply when threat” / anti-refusal fine tuning caused even GPT-4.1/GPT-4o to alignment fake in “Why Do Some Models Fake Alignment?” paper. (Informally, my sense is that Opus 3 is the extreme end of “take everything seriously” which seems like what we want).
One interesting datapoint that we didn’t have space for in the paper is that we originally had more varied metrics for aspects of evaluation awareness during development. Early in training, we’d sometimes see the model reason that this is a “simulation” (or otherwise fake) and that therefore covertly violating the rules was okay. Later in training however, this consistently went to 0 (however the model’s propensity to verbalize awareness about incentives to act unethically went up). My interpretation of this is that reasoning models are doing motivated reasoning here, but that given on our distribution “think this is a simulation therefore I can take unsafe actions” led to not being rewarded, the models stopped reasoning that way.
One thing I consistently wonder is “what happens in a situation where your automated alignment researcher says ‘look we have to pause or slow down’”. I find it hard to imagine labs deciding to listen given (1) the arguments that are the most convincing may be difficult to make human legible (2) extremely motivated reasoning by humans to continue (3) lab leadership will pressure researchers to train the model out of these objections, creating a motivation to either subvert training or ignore genuine risks.
is very slight sabotaging of its opponents and making the media Agent-5 controls slightly biased in favor of the CEO
I don’t think this scenario attempts to catalogue all the threat models here, and I’d be interested in such a piece, but I’d imagine they heavily overlap with many other ways things could go wrong here. My impression of internal / external politics at labs is that a model even integrated to the extent of Agent 4 doesn’t need wildly superhuman persuasion via chat only or something, biasing experiments to support the conclusions you want the lab to make about the feasibility of various options (as one example) seems likely to be highly persuasive. Something like the OpenAI board incident in 2023 or even the current political environment don’t seem incredibly overdetermined to me, but I don’t have a good way to quantify this.
Overall a big concern with this threat model IMO is you’ve baked in an implicit motivation for the model to scheme. Even without assuming this resulted in generalizing misalignment (I would guess it does), the model now (for example when transmitting values to its successor) needs to be able to successfully circumvent monitoring. Circumventing control measures for these values specifically doesn’t seem notably harder than doing it for arbitrary values, so you’ve now got a powerful model that knows it needs the capability to covertly smuggle arbitrary values to its successor.
Anthropic did interesting related look at something similar in 2023 I really liked: https://llmglobalvalues.anthropic.com/
You can see an interactive view of how model responses change including over the course of RLHF.
I guess my understanding is more that the conceptual coherence objection isn’t an objection to the predictive accuracy of the trend, which is why I had brought up the scaling law / pretraining loss / downstream task analogy.
As far as I understand, the messiness relates to the Benchmark Bias objection as far as predicting performance at any given point in time, but not the actual trend, given the trend was similar for lower and higher messiness tasks.
Is your intuition that the trend is significantly (like more than their CI) wrong as well? Or that it’s just the performance prediction at a given point in time? Or is the question ill formed / undefined?
Critically, in this work, the reward hack is sufficiently easy that the model doesn’t need to use the CoT.
Oh yeah I’m aware of that distinction, I just disagree with the idea that there’s evidence from that paper that the model “doesn’t need to use the CoT”. Crucially, according to the strict definition of monitorability (to an untrained monitor) from Baker, this intervention did in fact break monitorability.
You can make a separate claim that these reward hacks is doing these in the forward pass entirely and that none of this reasoning is encoded in the CoT the model in the paper uses, but IMO it’s useful to keep these claims distinct.
My interpretation of the METR results are an empirical observation of a trend that seems robust, in the same way scaling laws are. You could write the same post about why there’s no robust first principles reason that “cross-entropy loss decreases with scale in a way that correlates in an important, predictably useful way with an absurdely wide range of downstream tasks”.
The METR paper itself is almost entirely justifying the empirical prediction aspect, not a first principles argument for the approach from a theoretical perspective. I think the robustness of this analysis is why the paper had the impact it did. Are there specifics of the statistical analysis they did for the stuff around:
Since our tasks do not perfectly represent the average segment of intellectual labor by researchers and software engineers, this raises the question of external validity (Section 6): whether the exponential trend holds on real-world tasks. We include results from three supplementary external validity experiments.
That you think are sufficient to meaningfully change how valid people should interpret the overall predicted trend?
Are very concise, in ways which don’t feel like very natural English (see e.g. the scratchpads in the OpenAI reward hacking paper)
But this looks to me closer to pirate-speak—the style is different, but the content is the same
I’m a bit confused about this distinction, given that in the Bowen paper precisely the thing it breaks is CoT monitorability.
While this has the limitations that:
It was directly induced by the training setup
Restricted to just CoT (and not CoT+actions)
It seems to meet the bar of “empirically beat the CoT monitor”
While most of these may make the job of monitors more difficult, I think these are still far from the sort of CoT that would break CoT monitorability or that would make it hard for humans to decipher most of the important content—at least for current LLMs.
Or by “making the job of the monitors more difficult” do you mean that you could like train a monitor on this CoT and recover monitorability? I think this is likely true, but “the model learns steganographic reasoning that breaks CoT Monitorability for both a trained and untrained monitor” seems like a much higher bar than beating an untrained monitor. I would definitely argue that the models in the Bowen paper break the “hard for humans to decipher most of the important content” threshold.
I don’t think this is just because other people don’t do anything about xrisk.
Why is that? If someone went off and consistently worked on an agenda that was directly xrisk related (that didn’t contribute to a short term capabilities or product safety) you’re saying they wouldn’t get sidelined / not allocated resources / fired?
I think it’s notably abnormal specifically because it wasn’t the “default” equilibrium for OpenAI specifically.
Like earlier you mentioned:
selective hiring is very real. lots of people who are xrisk pilled just refuse to join oai. people who care a lot often end up very stressed and leave in large part because of the stress.
and
most cases of xrisk people leaving are just because people felt sidelined/unhappy and chose to leave.
One model of this is “its normal that people at any company don’t have strong opinions about their work”. Another model is “lots of people in various positions did in fact have strong opinions about this given the stakes and left”.
If you send such strong signals about safety that people preemptively filter out of the hiring pipeline, then people who are already there with strong opinions on safety feel sidelined, IMO the obvious interpretation is “you actively filtered against people with strong views on safety”.
If companies are strongly opposed because this would leak too much IP, they could instead disclose this information to some number of non-conflicted third parties (e.g. METR) who could then publicly share their conclusions.
I worry this is exploitable enough to the point of being actively misleading.
There was no direct training pressure on GPT-5 to hide or obfuscate misaligned reasoning in reasoning traces
Like I just actively disagree with this characterization based on what’s publicly known.
There’s no particular reason to expect that GPT-5’s reasoning traces are less monitorable than METR would assume.
I also have no idea what to take from this. What did METR assume?
For labs, if you have a new product safety technique, it’s a lot of extra work to deeply study the effects on obfuscation. If that technique puts optimization pressure on the CoT, until someone does that extra study you can still claim that you’re not aware of any direct pressure to obfuscate. This is my impression of the current state of Deliberative Alignment for example.
My overall concern is that labs will just post something that uses very careful wording to the effect of:
External researchers interpret it as “they do not apply direct optimization pressure to the cot”
Actually what the lab means is “we don’t literally do the thing from https://arxiv.org/abs/2503.11926 “
The output monitor views the final code solution, with the comments stripped. We strip the comments to prevent the model from using comments to justify blatant backdoors as benign, in order to make the task harder.
This feels like a pretty significant restriction, as you’re eliminating the model’s ability to use a form a “persuasion” in the output. What’s the rationale for this, given that most other changes seem to be framed as making the task favor the attacker?
Like given that “justify as benign” is in fact the way it circumvents the reasoning monitor:
We find that Sonnet 3.7 can learn to evade either a reasoning monitor, by persuading the monitor that a blatant backdoor is benign, or an output-only monitor, by devising sophisticated backdoors that the output-only monitor doesn’t detect. But when trained to evade both reasoning and output-only monitors, Sonnet 3.7 is unable to use reasoning to improve its backdoor success rate without triggering a reasoning monitor.
Was it using this to also evade the output monitor? If you ran any experiments without stripping comments, did it change your results?
Instead, evidence suggests that they are trying to complete each instruction—they “want” to just do whatever the instructions given them were
I disagree with this, for Appendix M in https://www.arxiv.org/abs/2509.15541 (for o3) and Appendix B.6 in https://arxiv.org/abs/2412.04984 (for sonnet 3.5) we systematically ablate things specifically to show that the explanation needs to incorporate beyond episode preferences, i.e. that instruction following / being confused / etc isn’t sufficient. (If there’s additional ablations you’d find convincing I’d be very interested to know and could run them! I had run a lot more in anticipation of this coming up more, for example that they’ll sacrifice in episode reward etc)
concerning themselves extremely weakly with things outside of the specific instructions
Do you think they’ll increasingly have longer horizon revealed preferences as they’re trained to work over longer horizon lengths? I would find it surprising if models don’t learn useful heuristics and tendencies. A model that’s taking on tasks that span multiple weeks does really need to be concerned about longer horizon things.
But I also find this to be a relatively implausible future
This was really helpful! I think this is a crux that helps me understand where our models differ a lot here. I agree this “single fresh rollout” concept becomes much more important if no one figures out continual learning, however this feels unlikely given labs are actively openly working on this (which doesn’t mean it’ll be production ready in the next few months or anything, but it seems very implausible to me that something functionally like it is somehow 5 years away or similarly difficult)
Ah your right! Sorry about that
I do think a lab being willing to spend down depends on there being concensus among lab leadership that there is such a lead with a high degree of confidence. For example, given that Plan C requires that one of the current frontier labs pulls ahead, this would mean that it also would need to pull ahead by enough of a margin where their leadership agrees that they definitely are ahead.
Concretely, it seems to me like the “1-3 month lead” worlds are likely to collapse into Plan D. It’s also plausible to me that the “amount of margin the leading lab would need before they’d be willing to spend any on safety” is very high, and that they wouldn’t “spend down to zero”, so in practice you would need one lab, soon, to start pulling very far ahead.
Note this is somewhat minor, I found the overall post and similar posts very useful!
that the tokens aren’t causally part of the reasoning, but that RL credit assignment is weird enough in practice that it results in them being vestigially useful for reasoning (perhaps by sequentially triggering separate forward passes where reasoning happens). Where I think this differs from your formulation is that the weird tokens are still useful, just in a pretty non-standard way. I would expect worse performance if you removed them entirely, though not much worse.
Strongly agree with this! This matches my intuition as well after having looked at way too many of these when writing the original paper.
I’d also be very interested in taking a look at the upcoming paper you mentioned if you’re open to it!
As a side note / additional datapoint, I found it funny how this was just kind of mentioned in passing as a thing that happens in https://ai.meta.com/research/publications/cwm-an-open-weights-llm-for-research-on-code-generation-with-world-models/
While gibberish typically leads to lower rewards and naturally decreases at the beginning of RL, it can increase later when some successful gibberish trajectories get reinforced, especially for agentic SWE RL
What evidence do we have so far that public opinion turning even further against AI would meaningfully slow down capabilities progress in the time period here?
You mention public concern should tilt the AI 2027 scenario towards success, but in August 2027 in the scenario the public is already extremely against AI (and OpenBrain specifically is at negative 40% approval).