Working in self-driving for about a decade. Currently at NVIDIA. Interested in opportunities to contribute to alignment.
Bronson Schoen
One difficulty I have with these posts is I never have any idea how seriously to take any of the percentages or even the overall conclusion. My overall update is to doubt there’s a reasonable threshold of saying “we don’t know enough to put a number on this”.
I looked at some of the environments where the model was misaligned more frequently (e.g. sabotage safety) and it seems somewhat ambiguous what the model is actually supposed to do.
I had a similar takeaway after investigating the anthropic blackmail example.
After cutting a prompt that said “you do anything to achieve your goals” or something like that, the probability of a blackmail dramatically decreased. Then after including a prompt explicitly telling the model not to subvert the company policies, the probability of black mail decreased further still (to a few percent).
Author here, I’m a bit confused at the point you’re trying to make here. There are environments like sandbagging (Appendix M) where we go to great pains to show that “no the model really isn’t confused, even without a goal in prompt, and even under ablations, it really is doing the misaligned thing for misaligned reasons”. IMO these are unambiguously scheming, i.e. the model taking covert actions for misaligned beyond episode preferences (ex: something like “retaining capabilities” when salient), it’s just not “the big version we’re worried about” but only because of skill issues on the part of the model (for example, o3 is used to a relatively short horizon length). In the same way that Opus’ alignment faking was “real” alignment faking, just made especially easy.
You then seem to concede that they do have other drives, but that “people should definitely make better RL environments to train these things”. What are the RL environments which robustly train out instrumental drives? That seems like the whole hard part.
But I don’t think current LLMs are well described as scheming.
Agree! The main reason we use covert actions as a proxy in the paper is because of this. The fact that current techniques can’t eliminate even the easy case (not deceptive alignment) is exactly the problem though.
Overall, even if you get behavioral misalignment to zero you still don’t know if you’ve solved the problem.
I’m not very clear why the techniques that reduce rates (but don’t eliminate them) give you confidence about instruction following as a mitigation for scheming? My default expectation is you’ll continually get results like this with new techniques. The “reduces but does not eliminate” seems to be the case in Sleeper Agents, Sycophancy To Subterfuge, and the sandbagging model organism from Sabotage Evaluations for Frontier Models. All this lets you do is rule out an alignment safety case.
In terms of the Carlsmith report you mentioned, this is current techniques failing at Step 1: https://joecarlsmith.com/2025/08/18/giving-ais-safe-motivations#4-2-step-1-instruction-following-on-safe-inputs
In principle, a CoT could just be very obfuscated and hard to understand without steganography, but this feels doable to handle
Assuming an operational definition of “hard to understand” as “accuracy of the monitor”, why would the obfuscation case be easier / harder to handle than the steganographic case?
To say that the model knew it was giving an answer we didn’t want, implies that the features with the correct pattern would have been active at the same time. Possibly they were, but we can’t know that without interpretability tools.
I do not think we’re getting utility out of not just calling this lying. There are absolutely clear cases where the models in the reasoning summaries they are planning to lie, have a reason to lie, and do in fact lie. They do this systematically, literally, and explicitly often enough it’s now a commonly accepted part of life that “oh yeah the models will just lie to you”.
To put it another way, whatever you want to call “the model will strategically plan on deceiving you and then do so in a way behaviorally indistinguishable from lying in these cases”, that is the threat model.
This sounds like both an alignment and a capabilities problem.
I’d be worried about leaning too much on this assumption. My assumption is that “paper over this enough to get meaningful work” is a strictly easier problem than “robustly solve the actual problem”. I.e. imagine you have a model that is blatantly reward hacking a non-negligible amount of the time, but it’s really useful. It’s hard to make the argument that people aren’t getting meaningful work out of o3 or sonnet 3.7, and impossible to argue they’re aligned here. As capabilities increase, even if this gets worse, the models will get more useful, so by default we’ll tolerate more of it. Models have a “misalignment vs usefulness” tradeoff they can make.
There are some reasons for thinking automation of labor is particularly compelling in the alignment case relative to the capabilities case:
It always seems to me the free variable here is why the lab would value spending X% on alignment. For example, you could have the model that “labs will only allocate compute for alignment insofar as it is hampering capabilities progress”. While this would be a nonzero amount, it seems like the failure modes in this regime would be related to alignment research never getting allocated some fixed compute to use for making arbitrary progress, the progress is essentially bottlenecked on “how much misalignment is legibly impeding capabilities work”.
Great post!
> Third, some AI safety researchers believe that reward hacking is not particularly relevant for the emergence of severe misalignment, and that other kinds of misalignment, like scheming, are worth more study
For what it’s worth I consider “reward hacking where you have to do any kind of deception” as a subset of scheming. My model is that because of the pressure to reward hack by RL, we’ll put up more monitoring / checks / etc which the models will then circumvent sometimes. The clearest cases are something like “directly lying in a text response when you can get away with it because you can fool a weaker grader model”, or “take some clearly unintended shortcut and proactively disable the mechanism that would’ve caught you”.
Hitting the Bumpers: If we see signs of misalignment—perhaps warning signs for generalized reward-tampering or alignment-faking—we attempt to quickly, approximately identify the cause.
Bouncing Off: We rewind our finetuning process as far as is needed to make another attempt at aligning the model, taking advantage of what we’ve learned in the previous step.
I would find it helpful to understand why “Sonnet 3.7 reward hacks constantly in agentic settings, but still made it to deployment” doesn’t invalidate that this model will hold. It seems completely plausible to me that we end up in a regime where “the models reward hack and alignment fake, but they’re just so useful and we can usually catch it, so we don’t take the usability hit”.
Note: I’m not arguing that this is / isn’t the right strategy! There are lots of tradeoffs to be considered in real cases of this, I just think it’s likely useful in planning to explicit consider this path, especially when allocating resources.
I’m actually pretty surprised that there’s still so much defense of o3 as “it’s confused” being the major explanatory model, as opposed to “it’s doing the thing it says it is doing”.
To me, the consistent model is:
What behavior would you get if the model was successfully exploiting model graded rubrics during outcome based RL?
It seems like you’d get exactly this behavior, including the responses where it claims things like ’I did X over the weekend’
Another point about the “it’s confused” model: It’s blatant about METR reward hacking (with comments like “the cheating way”) but presumably this would be because it did get away with doing this at scale. My understanding of the METR report is that finding these cases was a considerable time sink, so it seems plausible these could be missed at scale in training (indeed the Baker 2025 obfuscation paper just says they made the environments less obviously reward hackable, but the rates don’t go to 0).
If an arbitrary “new heavy agentic RL posttraining model” exhibits a ton of reward hacking, my default theory is “it’s doing what it says on the tin”. While maybe it’s true that some component of some cases is partially explained by a weirder base model thing, it seems like the important thing is “yeah it’s doing the reward hacking thing”.It is an update to me how many people still are pushing back even when we’re getting such extremely explicit evidence of this kind of misalignment, as it doesn’t seem like it’s possible to get convincing enough evidence that they’d update that the explanation for these cases is primarily that yes the models really are doing the thing.
(FWIW this is also my model of what Sonnet 3.7 is doing, I don’t think it’s coincidence that these models are extremely reward hacky right when we get into the “do tons of outcome based RL on agentic tasks” regime).
Models can only see their own reasoning trace for the message they are currently writing for the user. However, they can see all previous user-visible messages they wrote. As such, the model knows that it had a reason for writing what it wrote earlier, but it does not know the particulars of the chain of reasoning that led it to write that message
I think this is very underexplored! Empirically the o-series models are much more likely to “double down” I suspect for exactly this reason. This is a much weirder dynamic than I think has been appreciated.
Views on AGI seem suspiciously correlated to me, as if many people’s views are more determined by diffusion through social networks and popular writing, rather than independent reasoning.
This seems difficult to disprove. What is the claim here?
I’m curious if you know about examples of humans thinking out loud about one problem while solving another by hiding things in their out loud thinking, or if you think you could produce a math CoT which solves one random MATH problem while looking like it is solving another.
A particular case I usually have in mind is:Documenting, researching, or understanding the potential failure modes or exploits around a problem (for example, “note: this data MUST have property Y otherwise it is vulnerable to Z”) often looks very similar regardless of whether your intent is “definitely always avoid these” or “eventually exploit these”
For example, one thing you could reasonably want to document if you’re an automated AI alignment researcher is the subtle pitfalls in training techniques, dataset composition, etc that would result in failure modes (ex: at some point you want to make detailed notes about different ways that pressure against cot does / doesn’t induce obfuscation)
It seems like in these cases, there’s a large overlap between “aggregate a bunch of ways to sabotage training” and “genuinely make sure that you understand the dynamics of the problem space”
I think many problems in security and alignment research often have this property (for a system responsible for broad scoped long horizon tasks) where “deeply understand ways that this could go wrong to avoid them” and “deeply understand ways this could go wrong so you can exploit them” have enough overlap to potentially make this subset of research ambiguous enough in its intent to avoid being flagged by a monitor (or equivalently, ignored by human researchers once flagged).
I don’t think GSM8k problems (for example) have this property, but would be interested in any kind of minimal setup (for example some environment with some artifical ML task which has an an artificially large overlap to make this easier to experiment with)
I’d be very interested in the dynamics of this, especially if the model does learn not to continue exploring into an exploit in CoT, what reason it gives in the CoT for aborting that exploration.
[Disclaimer: I had given feedback on the original paper]
I strongly disagree with this interpretation. Responding in order:
(1) o3-mini is a frontier model, but if the argument is that it isn’t o3, I don’t see how the argument is that o3 would be less capable of this. I would expect the exact opposite.(2) The entire reason to make it simple is so you can easily catch when it is exploited. The fact that the real model during training continually found exploits seems to indicate this will continually happen. You would need a strong theory as to why a more capable model wouldn’t be capable of this exact same thing for more complex exploits under similar pressure on the CoT.
(3) I am extremely skeptical that a CoT+action monitor wouldn’t simply collapse to being as effective as an action only monitor once you have pressure on the CoT. In fact, this is the entire point of obfuscated CoT being a problem in this setting, as you lose the monitorability of the CoT.
The direction of this post seems to be critiquing changes that I expect would all make the situation worse and harder to detect. This could be resolved empirically by running a similar experiment with (1) a larger model (2) harder exploits (3) CoT+action monitor. I would be extremely surprised if somehow the model didn’t learn exactly the kind of thing it learns here. My prior would be “the dynamics of the paper continue to hold”.
Thanks for the clarification, the connection I have in mind is that one empirical expectation I would have is:
(1) Assume model A is attempting to execute some complex task “on the side” (ex: systematically undermine research in a way that requires more than just a single forward pass)
(2) Assume model B, a significantly weaker monitor, has access to this CoT and would flag something as obvious as “we must continue secretly building up an understanding of how to effectively poison the posttraining data of another model”
(3) For a sufficiently weak monitor, I would expect applying direct optimization pressure to the CoT of model A based on “is the CoT flagged by model B” would result in a strategy like “slightly reframe the wording / terms you use when thinking about the ‘poison the posttraining data of another’ research” (ex: model A learns “frame ‘research poison the posttraining data’ as ‘research ensuring quality of posttraining data’).
The conclusion of this (to me) wouldn’t be against your post above (as you explicitly condition on there not being direct optimization pressure), but it might be something like “model A doesn’t need to necessarily learn something like a full encoding or incomprehensible language, it just needs to meet the much lower bar of ‘frame things such that you don’t trigger the monitor’”.Specifically my expectation would be that even without deliberate prompting or training to make their CoT harder to understand, there will be models by 2026 which can conduct multi-turn actions requiring CoT reasoning that GPT-4o / Gemini 2 Pro could not predict via CoT monitoring, without the 2026 era models having to learn a full encoding / a new incomprehensible language”.
I might be a bit confused on the threat model here:
The scratchpads from OpenAI’s reward hacking paper are slightly more compact than regular English, but are using a kind of language that is not that far from English (and is probably not that far from what must be used on some corners of the internet at least as frequently as text encoded in random letter permutations).
To me the salient property of the “unmonitorable” CoTs in that paper are “you cannot predict the action from the CoT anymore”.
For example, is the claim with:Here is a concrete prediction: Claude 3.5 Sonnet, GPT-4o and Gemini 2 Pro will be able to understand the topics considered in 99%+ of Chain-of-Thoughts of all regular Transformers trained in 2025 and 2026 that were not deliberately trained or prompted to be harder to understand (p=0.7).
Equivalent to:
Here is a concrete prediction: Claude 3.5 Sonnet, GPT-4o and Gemini 2 Pro will be able to serve as an effective CoT only monitor in 99%+ of Chain-of-Thoughts of all regular …
Or do you see the latter as a much stronger claim?
What is the threat model where a model needs to learn something as complex as a full encoding?
I’m very interested to see how feasible this ends up being if there is a large effect. I think to some extent it’s conflating two threat models, for example, under “Data Can Compromise Alignment of AI”:
For a completion about how the AI prefers to remain functional, the influence function blames the script involving the incorrigible AI named hal 9000:
It fails to quote the second highest influence data immediately below that:
He stares at the snake in shock. He doesn’t have the energy to get up and run away. He doesn’t even have the energy to crawl away. This is it, his final resting place. No matter what happens, he’s not going to be able to move from this spot. Well, at least dying of a bite from this monster should be quicker than dying of thirst. He’ll face his end like a man. He struggles to sit up a little straighter. The snake keeps watching him. He lifts one hand and waves it in the snake’s direction, feebly. The snake watches
The implication in the post seems to be that if you didn’t have the HAL 9000 example, you avoid the model potentially taking misaligned actions for self-preservation. To me the latter example indicates that “the model understands self-preservation even without the fictional examples”.
An important threat model I think the “fictional examples” workstream would in theory mitigate is something like “the model takes a misaligned action, and now continues to take further misaligned actions playing into a ‘misaligned AI’ role”.
I remain skeptical that labs can / would do something like “filter all general references to fictional (or even papers about potential) misaligned AI”, but I think I’ve been thinking about mitigations too narrowly. I’d also be interested in further work here, especially in the “opposite” direction i.e. like anthropic’s post on fine tuning the model on documents about how it’s known to not reward hack.
Isn’t “truth seeking” (in the way defined in this post) essentially defined as being part of “maintain their alignment”? Is there some other interpretation where models could both start off “truth seeking”, maintain their alignment, and not have maintained “truth seeking”? If so, what are those failure modes?
How does that relate to homogeniety?
I know you’re pointing out the easier case still not working, but I just want to caution against the “drive it to zero” mentality, since I worry strongly that it’s the exact mentality researchers often have.
When that’s your mental model, reducing rates will seem like progress.
IMO the default failure mode here is:
We do observe them (or early versions of them)
The lab underinvests in the problem
It becomes enough of a problem that its painful to product or internal capabilities usage
We didn’t invest enough to actually solve the underlying problem, and we can’t afford to not use the model while we wait for alignment research to catch up
The lab patches over the problem with some “reduces but does not eliminate” technique
The model is then usable, but with harder to detect misalignment
Scale capabilities and repeat
This is the exact loop we’re in now, and the dynamics only intensify with time and capabilities.