This question reduces to whether the slightly even weaker superintelligence that trained this system was aligned with us.
Reduces with some loss, right? If we think there’s, say, a 98% chance that alignment survives each step, then whether or not the whole scheme works is a simple calculation from that chance.
(But again, that’s alignment surviving. We have to start with a base case that’s aligned, which we don’t have, and we shouldn’t mistake “systems that are not yet dangerous” from “systems that are aligned”.)
At some point, we need to actually align an AI system. But my claim is that this AI system doesn’t need to be much smarter than us, and it doesn’t need to be able to do much more work than we can evaluate.
I think this imagines alignment as a lumpy property—either the system is motivated to behave correctly, or it isn’t.
I think if people attempting to make this argument believed that there was, like, a crisp core to alignment, then I think this might be sensible. Like, if you somehow solved agent foundations, and had something that you thought was robust to scale, then it could just scale up its intelligence and you would be fine.
But instead I mostly see people who believe alignment appears gradually—we’ll use oversight bit by bit to pick out all of the misbehavior. But this means that ‘alignment’ is wide and fuzzy instead of crisp; it’s having the experience to handle ten thousand edge cases correctly.
But—how much generalization is there, between the edge cases? How much generalization is there, between the accounting AI that won’t violate GAAP and the megaproject AI that will successfully deliver a Dyson Sphere without wresting control of the lightcone for itself?
I think this argument is trying to have things both ways. We don’t need to figure out complicated or scalable alignment, because the iterative loop will do it for us, and we just need a simple base case. But also, it just so happens that the problem of alignment is naturally scalable—the iterative loop can always find an analogy between the simpler case and the more complicated case. An aligned executive assistant can solve alignment for a doctor, who can solve alignment for a legislator, who can solve alignment for a megaproject executor. And if any of those leaps is too large, well, we’ll be able to find a series of intermediate steps that isn’t too large.
And—I just don’t buy it. I think different levels of capabilities lead to categorically different alignment challenges. Like, with current systems there’s the problem of hallucinations, where the system thinks it’s supposed to be providing a detailed answer, regardless of whether or not it actually knows one, and it’s good at improvising answers. And some people think they’ve solved hallucinations thru an interpretability technique where they can just track whether or not the system thinks it knows what it’s talking about.
I think an oversight system that is guarded against hallucinating in that way is generically better than an oversight system that doesn’t have that. Do I think that makes a meaningful difference in its ability to solve the next alignment challenge (like sycophancy, say)? No, not really.
Maybe another way to put this argument is something like “be specific”. When you say you have an instruction-following AI, what do you mean by that? Something like “when I ask it to book me flights, it correctly interprets my travel plans and understands my relative preferences for departure time and in-air time and cost, and doesn’t make mistakes as it interfaces with websites to spend my money”? What are the specific subskills involved there, and will they transfer to other tasks?
But these arguments seem pretty hand-wavy to me, and much less robust than the argument that “if you put an AI system in an environment that is vastly different from the one where it was trained, you don’t know what it will do.”
I think we have some disagreement here about… how alignment works? I think if you believe your quoted sentence, then you shouldn’t be optimistic about iterative alignment, because you are training systems on aligning models of capability i and then putting them in environments where they have to align models of capability j. Like, when we take a system that is trained on overseeing LLMs to make sure they don’t offend human users, and then put it in charge of overseeing bioscience transformers to make they don’t create medicines that harm human consumers of that medicine, surely that’s a vast difference and we don’t know what it will do and not knowing what it will do makes us less confident in its ability to oversee rather than more.
And like—what’s the continuity argument, here? Can we smoothly introduce questions about whether novel drug designs will be harmful?
This is my biggest objection. I really don’t think any of the arguments for doom are simple and precise like this.
I mean, obviously, or we would be having a different conversation? But I was trying to explain why you not getting it is not convincing to me, because I get it.
Like, I think if you believe a sentence like “intelligence is challenging to align because it is intelligent”, that is a pretty simple argument that shifts lots of plausibilities, and puts people in the computer security mindset instead of the computer programming mindset.
(Like, does the same “iterative oversight” argument go thru for computer security? Does patching bugs and security holes make us better at patching future ones, in such a way that we can reliably trend towards 0 security holes? Or is computer security a constant battle between work to shore up systems and work adding new features, which increases the surface area for attacks? I think it’s the latter, and I think that simple argument should make us correspondingly suspicious of iterative oversight as an alignment solution.)
These questions seem really messy to me (Again, see Carlsmith’s report on whether early human-competitive AI systems will scheme. It’s messy).
I see your theoretical report from 2023 and raise you an empirical report from 2025, wherein models are obviously scheming, and their countermeasures don’t quite get rid of it. I’m not sure what “messy” means, in this context, and I think if you interpreted Joe’s report as a 25% chance of getting a report like the 2025 report, then you should view this as, like, a 3-1 update in favor of a more MIRI-ish view that thought there was a >75%[1] chance of getting a report like the 2025 report.
Why not higher? Because of the timing question—even if you’re nearly 100% confident that scheming will appear eventually, you don’t want to put all of your chips on it happening at any particular capability level. And unfortunately there’s probably not any preregistered predictions out there from MIRI-ish people of when scheming would show up, because what capabilities come online at what times is one of the hard calls.
I don’t think it’s clear at all from that previous report that O3 was scheming.
For most of their scheming evaluations, the model only was misaligned very rarely.
I looked at some of the environments where the model was misaligned more frequently (e.g. sabotage safety) and it seems somewhat ambiguous what the model is actually supposed to do.
The models definitely do egregiously bad things sometimes. But I am skeptical this is well described as scheming.
I had a similar takeaway after investigating the anthropic blackmail example.
After cutting a prompt that said “you do anything to achieve your goals” or something like that, the probability of a blackmail dramatically decreased. Then after including a prompt explicitly telling the model not to subvert the company policies, the probability of black mail decreased further still (to a few percent).
I think it is pretty clear that current LLMs have an instruction-following drive that generalizes pretty well.
But they also have other drives (like self-preservation). And this is definitely problematic. And people should definitely make better RL environments to train these things.
But I don’t think current LLMs are well described as scheming.
Regarding your other points, maybe you will find it interesting to read Carlsmith’s doc on how to control AI motivations:
Regarding your other points, maybe you will find it interesting to read Carlsmith’s doc on how to control AI motivations:
To be clear, this is a description of the alignment problem, not a solution. To quote from it:
in many respects, the picture above functions, in my head, centrally as a structured decomposition of the problems that an adequate approach to motivation control needs to overcome. It’s certainly not a “solution” to the alignment problem, in the sense of “a detailed, do-able, step-by-step plan that will work with high-confidence, and which requires only realistic deviation from the default trajectory.” And on its own, I’m not sure it even warrants the term “plan.”
(from the other comment):
I don’t have capacity to continue this discussion, but thanks for sharing your perspective.
Sure, I suspect this is a fine enough place to stop the conversation for now.
(Author of the empirical paper linked above) Nit: your comment is about o3 but the screenshot is of o4-mini. o3 has higher rates of taking covert actions across the board. Although I mostly agree with the top-level claim “current models aren’t well described as consistently scheming”.
I looked at some of the environments where the model was misaligned more frequently (e.g. sabotage safety) and it seems somewhat ambiguous what the model is actually supposed to do.
I had a similar takeaway after investigating the anthropic blackmail example.
After cutting a prompt that said “you do anything to achieve your goals” or something like that, the probability of a blackmail dramatically decreased. Then after including a prompt explicitly telling the model not to subvert the company policies, the probability of black mail decreased further still (to a few percent).
Author here, I’m a bit confused at the point you’re trying to make here. There are environments like sandbagging (Appendix M) where we go to great pains to show that “no the model really isn’t confused, even without a goal in prompt, and even under ablations, it really is doing the misaligned thing for misaligned reasons”. IMO these are unambiguously scheming, i.e. the model taking covert actions for misaligned beyond episode preferences (ex: something like “retaining capabilities” when salient), it’s just not “the big version we’re worried about” but only because of skill issues on the part of the model (for example, o3 is used to a relatively short horizon length). In the same way that Opus’ alignment faking was “real” alignment faking, just made especially easy.
You then seem to concede that they do have other drives, but that “people should definitely make better RL environments to train these things”. What are the RL environments which robustly train out instrumental drives? That seems like the whole hard part.
But I don’t think current LLMs are well described as scheming.
Agree! The main reason we use covert actions as a proxy in the paper is because of this. The fact that current techniques can’t eliminate even the easy case (not deceptive alignment) is exactly the problem though.
Overall, even if you get behavioral misalignment to zero you still don’t know if you’ve solved the problem.
I’m not very clear why the techniques that reduce rates (but don’t eliminate them) give you confidence about instruction following as a mitigation for scheming? My default expectation is you’ll continually get results like this with new techniques. The “reduces but does not eliminate” seems to be the case in Sleeper Agents, Sycophancy To Subterfuge, and the sandbagging model organism from Sabotage Evaluations for Frontier Models. All this lets you do is rule out an alignment safety case.
Thanks for your paper. I’m really glad people are investigating scheming behaviors empirically, and your environments seem like among the best out there. This is really important work.
I agree your ablations in Appendix M are persuasive and that something scheme-y and misaligned is definitely going on. But I took it from this figure that this behavior is fairly rare?
(I’m looking at the Sandbagging QA (No Goal) row here).
I think scheming that happens rarely is not well described as “scheming” in the classical sense, and is less concerning. But is still dangerous, and I agree is also not consistent with “instruction following.”
The main reason rare scheming is less concerning is that, in my experience, rare-schemers tend to report on each other most of the time. So untrusted monitoring usually mitigates most of the risk.
> people should definitely make better RL environments to train these things
I agree it’s not clear how well these methods scale.
Reduces with some loss, right? If we think there’s, say, a 98% chance that alignment survives each step, then whether or not the whole scheme works is a simple calculation from that chance.
(But again, that’s alignment surviving. We have to start with a base case that’s aligned, which we don’t have, and we shouldn’t mistake “systems that are not yet dangerous” from “systems that are aligned”.)
I think this imagines alignment as a lumpy property—either the system is motivated to behave correctly, or it isn’t.
I think if people attempting to make this argument believed that there was, like, a crisp core to alignment, then I think this might be sensible. Like, if you somehow solved agent foundations, and had something that you thought was robust to scale, then it could just scale up its intelligence and you would be fine.
But instead I mostly see people who believe alignment appears gradually—we’ll use oversight bit by bit to pick out all of the misbehavior. But this means that ‘alignment’ is wide and fuzzy instead of crisp; it’s having the experience to handle ten thousand edge cases correctly.
But—how much generalization is there, between the edge cases? How much generalization is there, between the accounting AI that won’t violate GAAP and the megaproject AI that will successfully deliver a Dyson Sphere without wresting control of the lightcone for itself?
I think this argument is trying to have things both ways. We don’t need to figure out complicated or scalable alignment, because the iterative loop will do it for us, and we just need a simple base case. But also, it just so happens that the problem of alignment is naturally scalable—the iterative loop can always find an analogy between the simpler case and the more complicated case. An aligned executive assistant can solve alignment for a doctor, who can solve alignment for a legislator, who can solve alignment for a megaproject executor. And if any of those leaps is too large, well, we’ll be able to find a series of intermediate steps that isn’t too large.
And—I just don’t buy it. I think different levels of capabilities lead to categorically different alignment challenges. Like, with current systems there’s the problem of hallucinations, where the system thinks it’s supposed to be providing a detailed answer, regardless of whether or not it actually knows one, and it’s good at improvising answers. And some people think they’ve solved hallucinations thru an interpretability technique where they can just track whether or not the system thinks it knows what it’s talking about.
I think an oversight system that is guarded against hallucinating in that way is generically better than an oversight system that doesn’t have that. Do I think that makes a meaningful difference in its ability to solve the next alignment challenge (like sycophancy, say)? No, not really.
Maybe another way to put this argument is something like “be specific”. When you say you have an instruction-following AI, what do you mean by that? Something like “when I ask it to book me flights, it correctly interprets my travel plans and understands my relative preferences for departure time and in-air time and cost, and doesn’t make mistakes as it interfaces with websites to spend my money”? What are the specific subskills involved there, and will they transfer to other tasks?
I think we have some disagreement here about… how alignment works? I think if you believe your quoted sentence, then you shouldn’t be optimistic about iterative alignment, because you are training systems on aligning models of capability i and then putting them in environments where they have to align models of capability j. Like, when we take a system that is trained on overseeing LLMs to make sure they don’t offend human users, and then put it in charge of overseeing bioscience transformers to make they don’t create medicines that harm human consumers of that medicine, surely that’s a vast difference and we don’t know what it will do and not knowing what it will do makes us less confident in its ability to oversee rather than more.
And like—what’s the continuity argument, here? Can we smoothly introduce questions about whether novel drug designs will be harmful?
I mean, obviously, or we would be having a different conversation? But I was trying to explain why you not getting it is not convincing to me, because I get it.
Like, I think if you believe a sentence like “intelligence is challenging to align because it is intelligent”, that is a pretty simple argument that shifts lots of plausibilities, and puts people in the computer security mindset instead of the computer programming mindset.
(Like, does the same “iterative oversight” argument go thru for computer security? Does patching bugs and security holes make us better at patching future ones, in such a way that we can reliably trend towards 0 security holes? Or is computer security a constant battle between work to shore up systems and work adding new features, which increases the surface area for attacks? I think it’s the latter, and I think that simple argument should make us correspondingly suspicious of iterative oversight as an alignment solution.)
I see your theoretical report from 2023 and raise you an empirical report from 2025, wherein models are obviously scheming, and their countermeasures don’t quite get rid of it. I’m not sure what “messy” means, in this context, and I think if you interpreted Joe’s report as a 25% chance of getting a report like the 2025 report, then you should view this as, like, a 3-1 update in favor of a more MIRI-ish view that thought there was a >75%[1] chance of getting a report like the 2025 report.
Why not higher? Because of the timing question—even if you’re nearly 100% confident that scheming will appear eventually, you don’t want to put all of your chips on it happening at any particular capability level. And unfortunately there’s probably not any preregistered predictions out there from MIRI-ish people of when scheming would show up, because what capabilities come online at what times is one of the hard calls.
I don’t think it’s clear at all from that previous report that O3 was scheming.
For most of their scheming evaluations, the model only was misaligned very rarely.
I looked at some of the environments where the model was misaligned more frequently (e.g. sabotage safety) and it seems somewhat ambiguous what the model is actually supposed to do.
The models definitely do egregiously bad things sometimes. But I am skeptical this is well described as scheming.
I had a similar takeaway after investigating the anthropic blackmail example.
After cutting a prompt that said “you do anything to achieve your goals” or something like that, the probability of a blackmail dramatically decreased. Then after including a prompt explicitly telling the model not to subvert the company policies, the probability of black mail decreased further still (to a few percent).
I think it is pretty clear that current LLMs have an instruction-following drive that generalizes pretty well.
But they also have other drives (like self-preservation). And this is definitely problematic. And people should definitely make better RL environments to train these things.
But I don’t think current LLMs are well described as scheming.
Regarding your other points, maybe you will find it interesting to read Carlsmith’s doc on how to control AI motivations:
https://joecarlsmith.com/2025/08/18/giving-ais-safe-motivations
This doc mostly aligns with my views here, and I think it’s written clearly and precisely.
To be clear, this is a description of the alignment problem, not a solution. To quote from it:
(from the other comment):
Sure, I suspect this is a fine enough place to stop the conversation for now.
(Author of the empirical paper linked above) Nit: your comment is about o3 but the screenshot is of o4-mini. o3 has higher rates of taking covert actions across the board. Although I mostly agree with the top-level claim “current models aren’t well described as consistently scheming”.
I don’t have capacity to continue this discussion, but thanks for sharing your perspective.
Author here, I’m a bit confused at the point you’re trying to make here. There are environments like sandbagging (Appendix M) where we go to great pains to show that “no the model really isn’t confused, even without a goal in prompt, and even under ablations, it really is doing the misaligned thing for misaligned reasons”. IMO these are unambiguously scheming, i.e. the model taking covert actions for misaligned beyond episode preferences (ex: something like “retaining capabilities” when salient), it’s just not “the big version we’re worried about” but only because of skill issues on the part of the model (for example, o3 is used to a relatively short horizon length). In the same way that Opus’ alignment faking was “real” alignment faking, just made especially easy.
You then seem to concede that they do have other drives, but that “people should definitely make better RL environments to train these things”. What are the RL environments which robustly train out instrumental drives? That seems like the whole hard part.
Agree! The main reason we use covert actions as a proxy in the paper is because of this. The fact that current techniques can’t eliminate even the easy case (not deceptive alignment) is exactly the problem though.
Overall, even if you get behavioral misalignment to zero you still don’t know if you’ve solved the problem.
I’m not very clear why the techniques that reduce rates (but don’t eliminate them) give you confidence about instruction following as a mitigation for scheming? My default expectation is you’ll continually get results like this with new techniques. The “reduces but does not eliminate” seems to be the case in Sleeper Agents, Sycophancy To Subterfuge, and the sandbagging model organism from Sabotage Evaluations for Frontier Models. All this lets you do is rule out an alignment safety case.
In terms of the Carlsmith report you mentioned, this is current techniques failing at Step 1: https://joecarlsmith.com/2025/08/18/giving-ais-safe-motivations#4-2-step-1-instruction-following-on-safe-inputs
Hey Bronson,
Thanks for your paper. I’m really glad people are investigating scheming behaviors empirically, and your environments seem like among the best out there. This is really important work.
I agree your ablations in Appendix M are persuasive and that something scheme-y and misaligned is definitely going on. But I took it from this figure that this behavior is fairly rare?
(I’m looking at the Sandbagging QA (No Goal) row here).
I think scheming that happens rarely is not well described as “scheming” in the classical sense, and is less concerning. But is still dangerous, and I agree is also not consistent with “instruction following.”
The main reason rare scheming is less concerning is that, in my experience, rare-schemers tend to report on each other most of the time. So untrusted monitoring usually mitigates most of the risk.
> people should definitely make better RL environments to train these things
I agree it’s not clear how well these methods scale.