But this seems like the kind of question where the outcome could go either way.
I think this ‘seeming’ is deceptive. For example, consider the question of what the last digit of 139! is. The correct number is definitely out there—it’s the result of a deterministic computation—but it might take a lot of calculation to determine. Maybe it’s a one, maybe it’s a two, maybe—you should just put a uniform distribution on all of the options?
I encourage you to actually think about that one, for a bit.
Consider that the product of all natural numbers between 1 and 139 will contain the number 10. Multiplying any number by 10 will give you that number, but with a 0 at the end, and multiplying 0 by any other number gives you 0. Therefore the last digit is a 0.
I think Eliezer has discovered reasoning in this more complicated domain that—while it’s not as clear and concise as the preceding paragraph—is roughly as difficult to unsee once you’ve seen it, but perhaps not obvious before someone spells it out. It becomes hard to empathize with the feeling that “it could go either way” because you see gravity pulling down and you don’t see a similar force pushing up. If you put water in a sphere, it’s going to end up on the bottom, not be equally likely to be on any part of the sphere.
And, yes, surface tension complicates the story a little bit—reality has wrinkles!--but not enough that it changes the basic logic.
I think you need AI agents that can be trusted to perform one-year ML research tasks (see story #2).
I think this is more like a solution to the alignment problem than it is something we have now, and so you are still assuming your conclusion as a premise. Claude still cheats on one-hour programming tasks. But at least for programming tasks, we can automatically check whether Claude did things like “change the test code”, and maybe we can ask another instance of Claude to look at a pull request and tell whether it’s cheating or a legitimate upgrade.
But as soon as we’re asking it to do alignment research—to reason about whether a change to an AI will make it more or less likely to follow instructions, as those instructions become more complicated and require more context to evaluate—how are we going to oversee it, and determine whether its changes improve the situation, or are allowing it to evade further oversight?
“how are we going to oversee it, and determine whether its changes improve the situation, or are allowing it to evade further oversight?”
Consider a similar question: Suppose superintelligences were building Dyson spheres in 2035. How would we oversee their Dyson sphere work? How would we train them to build Dyson Spheres correctly?
Clearly we wouldn’t be training them at that point. Some slightly weaker superintelligence would be overseeing them.
Whether we can oversee their Dyson Sphere work is not important. The important question is whether that slightly weaker superintelligence would be aligned with us.
This question reduces to whether the slightly even weaker superintelligence that trained this system was aligned with us. Et cetera, et cetera.
At some point, we need to actually align an AI system. But my claim is that this AI system doesn’t need to be much smarter than us, and it doesn’t need to be able to do much more work than we can evaluate.
Quoting some points in the post that elaborate:
”Then, once AI systems have built a slightly more capable and trustworthy successor, this successor will then build an even more capable successor, and so on.
At every step, the alignment problem each generation of AI must tackle is of aligning a slightly more capable successor. No system needs to align an AI system that is vastly smarter than itself.[3] And so the alignment problem each iteration needs to tackle does not obviously become much harder as capabilities improve.”
Footnote:
”But you might object, ‘haven’t you folded the extreme distribution shift in capabilities into many small distribution shifts? Surely you’ve swept the problem under a rug.’
No, the distribution shift was not swept under the rug. There is no extreme distribution shift because the labor directed at oversight scales commensurably with AI capability. As AI systems do more work, they become harder for humans to evaluate, but they also put more labor into the task of evaluation.
There’s a separate question of whether we can build this trustworthy AI system in the first place that can safely kick off this process. I claimed that an AI system that we can trust to do one-year research tasks would probably be sufficient.
So then the question is whether we can train an AI system on one-year research tasks that we can trust to perform fairly similar one-year research tasks?
I think it’s at least worth noting this is a separate question than the question of whether alignment will generalize across huge distribution shifts.
I think this question mostly comes down to whether early human competitive AI systems will “scheme.” Will they misgeneralize across a fairly subtle distribution shift of research tasks, because they think they’re no longer being evaluated closely?
I don’t think IABIED really answered this question. I took the main thrust of the argument, especially in Part One, to be something like what I said at the start of the post:
> We cannot predict what AI systems will do once AI is much more powerful and has much broader affordances than it had in training.
The authors and you might still think that early human-competitive AI systems will scheme and misgeneralize across subtle distribution shifts, and I think there are reasonable justifications for this. e.g. “there are a lot more misaligned goals in goal space than aligned goals.” But these arguments seem pretty hand-wavy to me, and much less robust than the argument that “if you put an AI system in an environment that is vastly different from the one where it was trained, you don’t know what it will do.”
On the question of whether early human competitive AI systems will scheme, I don’t have a lot to say that’s not already in Joe Carlsmith’s report on the subject:
> I think Eliezer has discovered reasoning in this more complicated domain that—while it’s not as clear and concise as the preceding paragraph—is roughly as difficult to unsee once you’ve seen it
This is my biggest objection. I really don’t think any of the arguments for doom are simple and precise like this.
These questions seem really messy to me (Again, see Carlsmith’s report on whether early human-competitive AI systems will scheme. It’s messy).
I think it’s totally reasonable to have an intuition that AI systems will probably scheme, and thatthis is probably hard to deal with, and that we’re probably doomed. That’s not what I’m disagreeing with. My main claim is that it’s just really messy. And high confidence is totally unjustified.
This question reduces to whether the slightly even weaker superintelligence that trained this system was aligned with us.
Reduces with some loss, right? If we think there’s, say, a 98% chance that alignment survives each step, then whether or not the whole scheme works is a simple calculation from that chance.
(But again, that’s alignment surviving. We have to start with a base case that’s aligned, which we don’t have, and we shouldn’t mistake “systems that are not yet dangerous” from “systems that are aligned”.)
At some point, we need to actually align an AI system. But my claim is that this AI system doesn’t need to be much smarter than us, and it doesn’t need to be able to do much more work than we can evaluate.
I think this imagines alignment as a lumpy property—either the system is motivated to behave correctly, or it isn’t.
I think if people attempting to make this argument believed that there was, like, a crisp core to alignment, then I think this might be sensible. Like, if you somehow solved agent foundations, and had something that you thought was robust to scale, then it could just scale up its intelligence and you would be fine.
But instead I mostly see people who believe alignment appears gradually—we’ll use oversight bit by bit to pick out all of the misbehavior. But this means that ‘alignment’ is wide and fuzzy instead of crisp; it’s having the experience to handle ten thousand edge cases correctly.
But—how much generalization is there, between the edge cases? How much generalization is there, between the accounting AI that won’t violate GAAP and the megaproject AI that will successfully deliver a Dyson Sphere without wresting control of the lightcone for itself?
I think this argument is trying to have things both ways. We don’t need to figure out complicated or scalable alignment, because the iterative loop will do it for us, and we just need a simple base case. But also, it just so happens that the problem of alignment is naturally scalable—the iterative loop can always find an analogy between the simpler case and the more complicated case. An aligned executive assistant can solve alignment for a doctor, who can solve alignment for a legislator, who can solve alignment for a megaproject executor. And if any of those leaps is too large, well, we’ll be able to find a series of intermediate steps that isn’t too large.
And—I just don’t buy it. I think different levels of capabilities lead to categorically different alignment challenges. Like, with current systems there’s the problem of hallucinations, where the system thinks it’s supposed to be providing a detailed answer, regardless of whether or not it actually knows one, and it’s good at improvising answers. And some people think they’ve solved hallucinations thru an interpretability technique where they can just track whether or not the system thinks it knows what it’s talking about.
I think an oversight system that is guarded against hallucinating in that way is generically better than an oversight system that doesn’t have that. Do I think that makes a meaningful difference in its ability to solve the next alignment challenge (like sycophancy, say)? No, not really.
Maybe another way to put this argument is something like “be specific”. When you say you have an instruction-following AI, what do you mean by that? Something like “when I ask it to book me flights, it correctly interprets my travel plans and understands my relative preferences for departure time and in-air time and cost, and doesn’t make mistakes as it interfaces with websites to spend my money”? What are the specific subskills involved there, and will they transfer to other tasks?
But these arguments seem pretty hand-wavy to me, and much less robust than the argument that “if you put an AI system in an environment that is vastly different from the one where it was trained, you don’t know what it will do.”
I think we have some disagreement here about… how alignment works? I think if you believe your quoted sentence, then you shouldn’t be optimistic about iterative alignment, because you are training systems on aligning models of capability i and then putting them in environments where they have to align models of capability j. Like, when we take a system that is trained on overseeing LLMs to make sure they don’t offend human users, and then put it in charge of overseeing bioscience transformers to make they don’t create medicines that harm human consumers of that medicine, surely that’s a vast difference and we don’t know what it will do and not knowing what it will do makes us less confident in its ability to oversee rather than more.
And like—what’s the continuity argument, here? Can we smoothly introduce questions about whether novel drug designs will be harmful?
This is my biggest objection. I really don’t think any of the arguments for doom are simple and precise like this.
I mean, obviously, or we would be having a different conversation? But I was trying to explain why you not getting it is not convincing to me, because I get it.
Like, I think if you believe a sentence like “intelligence is challenging to align because it is intelligent”, that is a pretty simple argument that shifts lots of plausibilities, and puts people in the computer security mindset instead of the computer programming mindset.
(Like, does the same “iterative oversight” argument go thru for computer security? Does patching bugs and security holes make us better at patching future ones, in such a way that we can reliably trend towards 0 security holes? Or is computer security a constant battle between work to shore up systems and work adding new features, which increases the surface area for attacks? I think it’s the latter, and I think that simple argument should make us correspondingly suspicious of iterative oversight as an alignment solution.)
These questions seem really messy to me (Again, see Carlsmith’s report on whether early human-competitive AI systems will scheme. It’s messy).
I see your theoretical report from 2023 and raise you an empirical report from 2025, wherein models are obviously scheming, and their countermeasures don’t quite get rid of it. I’m not sure what “messy” means, in this context, and I think if you interpreted Joe’s report as a 25% chance of getting a report like the 2025 report, then you should view this as, like, a 3-1 update in favor of a more MIRI-ish view that thought there was a >75%[1] chance of getting a report like the 2025 report.
Why not higher? Because of the timing question—even if you’re nearly 100% confident that scheming will appear eventually, you don’t want to put all of your chips on it happening at any particular capability level. And unfortunately there’s probably not any preregistered predictions out there from MIRI-ish people of when scheming would show up, because what capabilities come online at what times is one of the hard calls.
I don’t think it’s clear at all from that previous report that O3 was scheming.
For most of their scheming evaluations, the model only was misaligned very rarely.
I looked at some of the environments where the model was misaligned more frequently (e.g. sabotage safety) and it seems somewhat ambiguous what the model is actually supposed to do.
The models definitely do egregiously bad things sometimes. But I am skeptical this is well described as scheming.
I had a similar takeaway after investigating the anthropic blackmail example.
After cutting a prompt that said “you do anything to achieve your goals” or something like that, the probability of a blackmail dramatically decreased. Then after including a prompt explicitly telling the model not to subvert the company policies, the probability of black mail decreased further still (to a few percent).
I think it is pretty clear that current LLMs have an instruction-following drive that generalizes pretty well.
But they also have other drives (like self-preservation). And this is definitely problematic. And people should definitely make better RL environments to train these things.
But I don’t think current LLMs are well described as scheming.
Regarding your other points, maybe you will find it interesting to read Carlsmith’s doc on how to control AI motivations:
Regarding your other points, maybe you will find it interesting to read Carlsmith’s doc on how to control AI motivations:
To be clear, this is a description of the alignment problem, not a solution. To quote from it:
in many respects, the picture above functions, in my head, centrally as a structured decomposition of the problems that an adequate approach to motivation control needs to overcome. It’s certainly not a “solution” to the alignment problem, in the sense of “a detailed, do-able, step-by-step plan that will work with high-confidence, and which requires only realistic deviation from the default trajectory.” And on its own, I’m not sure it even warrants the term “plan.”
(from the other comment):
I don’t have capacity to continue this discussion, but thanks for sharing your perspective.
Sure, I suspect this is a fine enough place to stop the conversation for now.
(Author of the empirical paper linked above) Nit: your comment is about o3 but the screenshot is of o4-mini. o3 has higher rates of taking covert actions across the board. Although I mostly agree with the top-level claim “current models aren’t well described as consistently scheming”.
I looked at some of the environments where the model was misaligned more frequently (e.g. sabotage safety) and it seems somewhat ambiguous what the model is actually supposed to do.
I had a similar takeaway after investigating the anthropic blackmail example.
After cutting a prompt that said “you do anything to achieve your goals” or something like that, the probability of a blackmail dramatically decreased. Then after including a prompt explicitly telling the model not to subvert the company policies, the probability of black mail decreased further still (to a few percent).
Author here, I’m a bit confused at the point you’re trying to make here. There are environments like sandbagging (Appendix M) where we go to great pains to show that “no the model really isn’t confused, even without a goal in prompt, and even under ablations, it really is doing the misaligned thing for misaligned reasons”. IMO these are unambiguously scheming, i.e. the model taking covert actions for misaligned beyond episode preferences (ex: something like “retaining capabilities” when salient), it’s just not “the big version we’re worried about” but only because of skill issues on the part of the model (for example, o3 is used to a relatively short horizon length). In the same way that Opus’ alignment faking was “real” alignment faking, just made especially easy.
You then seem to concede that they do have other drives, but that “people should definitely make better RL environments to train these things”. What are the RL environments which robustly train out instrumental drives? That seems like the whole hard part.
But I don’t think current LLMs are well described as scheming.
Agree! The main reason we use covert actions as a proxy in the paper is because of this. The fact that current techniques can’t eliminate even the easy case (not deceptive alignment) is exactly the problem though.
Overall, even if you get behavioral misalignment to zero you still don’t know if you’ve solved the problem.
I’m not very clear why the techniques that reduce rates (but don’t eliminate them) give you confidence about instruction following as a mitigation for scheming? My default expectation is you’ll continually get results like this with new techniques. The “reduces but does not eliminate” seems to be the case in Sleeper Agents, Sycophancy To Subterfuge, and the sandbagging model organism from Sabotage Evaluations for Frontier Models. All this lets you do is rule out an alignment safety case.
Thanks for your paper. I’m really glad people are investigating scheming behaviors empirically, and your environments seem like among the best out there. This is really important work.
I agree your ablations in Appendix M are persuasive and that something scheme-y and misaligned is definitely going on. But I took it from this figure that this behavior is fairly rare?
(I’m looking at the Sandbagging QA (No Goal) row here).
I think scheming that happens rarely is not well described as “scheming” in the classical sense, and is less concerning. But is still dangerous, and I agree is also not consistent with “instruction following.”
The main reason rare scheming is less concerning is that, in my experience, rare-schemers tend to report on each other most of the time. So untrusted monitoring usually mitigates most of the risk.
> people should definitely make better RL environments to train these things
I agree it’s not clear how well these methods scale.
At some point, we need to actually align an AI system. But my claim is that this AI system doesn’t need to be much smarter than us, and it doesn’t need to be able to do much more work than we can evaluate.
IMO even if this is true, very clearly AIs are misaligned right now, and insofar as the very very underdeveloped world of model evals don’t show that, I have personal experience with telling them to do something, and them routinely fucking me over in subtle & malicious enough ways that I do think its intentional.
I think this ‘seeming’ is deceptive. For example, consider the question of what the last digit of 139! is. The correct number is definitely out there—it’s the result of a deterministic computation—but it might take a lot of calculation to determine. Maybe it’s a one, maybe it’s a two, maybe—you should just put a uniform distribution on all of the options?
I encourage you to actually think about that one, for a bit.
Consider that the product of all natural numbers between 1 and 139 will contain the number 10. Multiplying any number by 10 will give you that number, but with a 0 at the end, and multiplying 0 by any other number gives you 0. Therefore the last digit is a 0.
I think Eliezer has discovered reasoning in this more complicated domain that—while it’s not as clear and concise as the preceding paragraph—is roughly as difficult to unsee once you’ve seen it, but perhaps not obvious before someone spells it out. It becomes hard to empathize with the feeling that “it could go either way” because you see gravity pulling down and you don’t see a similar force pushing up. If you put water in a sphere, it’s going to end up on the bottom, not be equally likely to be on any part of the sphere.
And, yes, surface tension complicates the story a little bit—reality has wrinkles!--but not enough that it changes the basic logic.
I think this is more like a solution to the alignment problem than it is something we have now, and so you are still assuming your conclusion as a premise. Claude still cheats on one-hour programming tasks. But at least for programming tasks, we can automatically check whether Claude did things like “change the test code”, and maybe we can ask another instance of Claude to look at a pull request and tell whether it’s cheating or a legitimate upgrade.
But as soon as we’re asking it to do alignment research—to reason about whether a change to an AI will make it more or less likely to follow instructions, as those instructions become more complicated and require more context to evaluate—how are we going to oversee it, and determine whether its changes improve the situation, or are allowing it to evade further oversight?
“how are we going to oversee it, and determine whether its changes improve the situation, or are allowing it to evade further oversight?”
Consider a similar question: Suppose superintelligences were building Dyson spheres in 2035. How would we oversee their Dyson sphere work? How would we train them to build Dyson Spheres correctly?
Clearly we wouldn’t be training them at that point. Some slightly weaker superintelligence would be overseeing them.
Whether we can oversee their Dyson Sphere work is not important. The important question is whether that slightly weaker superintelligence would be aligned with us.
This question reduces to whether the slightly even weaker superintelligence that trained this system was aligned with us. Et cetera, et cetera.
At some point, we need to actually align an AI system. But my claim is that this AI system doesn’t need to be much smarter than us, and it doesn’t need to be able to do much more work than we can evaluate.
Quoting some points in the post that elaborate:
”Then, once AI systems have built a slightly more capable and trustworthy successor, this successor will then build an even more capable successor, and so on.
At every step, the alignment problem each generation of AI must tackle is of aligning a slightly more capable successor. No system needs to align an AI system that is vastly smarter than itself.[3] And so the alignment problem each iteration needs to tackle does not obviously become much harder as capabilities improve.”
Footnote:
”But you might object, ‘haven’t you folded the extreme distribution shift in capabilities into many small distribution shifts? Surely you’ve swept the problem under a rug.’
No, the distribution shift was not swept under the rug. There is no extreme distribution shift because the labor directed at oversight scales commensurably with AI capability. As AI systems do more work, they become harder for humans to evaluate, but they also put more labor into the task of evaluation.
See Carlsmith for a more detailed account of these dynamics: https://joecarlsmith.com/2025/08/18/giving-ais-safe-motivations/″
There’s a separate question of whether we can build this trustworthy AI system in the first place that can safely kick off this process. I claimed that an AI system that we can trust to do one-year research tasks would probably be sufficient.
So then the question is whether we can train an AI system on one-year research tasks that we can trust to perform fairly similar one-year research tasks?
I think it’s at least worth noting this is a separate question than the question of whether alignment will generalize across huge distribution shifts.
I think this question mostly comes down to whether early human competitive AI systems will “scheme.” Will they misgeneralize across a fairly subtle distribution shift of research tasks, because they think they’re no longer being evaluated closely?
I don’t think IABIED really answered this question. I took the main thrust of the argument, especially in Part One, to be something like what I said at the start of the post:
> We cannot predict what AI systems will do once AI is much more powerful and has much broader affordances than it had in training.
The authors and you might still think that early human-competitive AI systems will scheme and misgeneralize across subtle distribution shifts, and I think there are reasonable justifications for this. e.g. “there are a lot more misaligned goals in goal space than aligned goals.” But these arguments seem pretty hand-wavy to me, and much less robust than the argument that “if you put an AI system in an environment that is vastly different from the one where it was trained, you don’t know what it will do.”
On the question of whether early human competitive AI systems will scheme, I don’t have a lot to say that’s not already in Joe Carlsmith’s report on the subject:
https://arxiv.org/abs/2311.08379
> I think Eliezer has discovered reasoning in this more complicated domain that—while it’s not as clear and concise as the preceding paragraph—is roughly as difficult to unsee once you’ve seen it
This is my biggest objection. I really don’t think any of the arguments for doom are simple and precise like this.
These questions seem really messy to me (Again, see Carlsmith’s report on whether early human-competitive AI systems will scheme. It’s messy).
I think it’s totally reasonable to have an intuition that AI systems will probably scheme, and that this is probably hard to deal with, and that we’re probably doomed. That’s not what I’m disagreeing with. My main claim is that it’s just really messy. And high confidence is totally unjustified.
Reduces with some loss, right? If we think there’s, say, a 98% chance that alignment survives each step, then whether or not the whole scheme works is a simple calculation from that chance.
(But again, that’s alignment surviving. We have to start with a base case that’s aligned, which we don’t have, and we shouldn’t mistake “systems that are not yet dangerous” from “systems that are aligned”.)
I think this imagines alignment as a lumpy property—either the system is motivated to behave correctly, or it isn’t.
I think if people attempting to make this argument believed that there was, like, a crisp core to alignment, then I think this might be sensible. Like, if you somehow solved agent foundations, and had something that you thought was robust to scale, then it could just scale up its intelligence and you would be fine.
But instead I mostly see people who believe alignment appears gradually—we’ll use oversight bit by bit to pick out all of the misbehavior. But this means that ‘alignment’ is wide and fuzzy instead of crisp; it’s having the experience to handle ten thousand edge cases correctly.
But—how much generalization is there, between the edge cases? How much generalization is there, between the accounting AI that won’t violate GAAP and the megaproject AI that will successfully deliver a Dyson Sphere without wresting control of the lightcone for itself?
I think this argument is trying to have things both ways. We don’t need to figure out complicated or scalable alignment, because the iterative loop will do it for us, and we just need a simple base case. But also, it just so happens that the problem of alignment is naturally scalable—the iterative loop can always find an analogy between the simpler case and the more complicated case. An aligned executive assistant can solve alignment for a doctor, who can solve alignment for a legislator, who can solve alignment for a megaproject executor. And if any of those leaps is too large, well, we’ll be able to find a series of intermediate steps that isn’t too large.
And—I just don’t buy it. I think different levels of capabilities lead to categorically different alignment challenges. Like, with current systems there’s the problem of hallucinations, where the system thinks it’s supposed to be providing a detailed answer, regardless of whether or not it actually knows one, and it’s good at improvising answers. And some people think they’ve solved hallucinations thru an interpretability technique where they can just track whether or not the system thinks it knows what it’s talking about.
I think an oversight system that is guarded against hallucinating in that way is generically better than an oversight system that doesn’t have that. Do I think that makes a meaningful difference in its ability to solve the next alignment challenge (like sycophancy, say)? No, not really.
Maybe another way to put this argument is something like “be specific”. When you say you have an instruction-following AI, what do you mean by that? Something like “when I ask it to book me flights, it correctly interprets my travel plans and understands my relative preferences for departure time and in-air time and cost, and doesn’t make mistakes as it interfaces with websites to spend my money”? What are the specific subskills involved there, and will they transfer to other tasks?
I think we have some disagreement here about… how alignment works? I think if you believe your quoted sentence, then you shouldn’t be optimistic about iterative alignment, because you are training systems on aligning models of capability i and then putting them in environments where they have to align models of capability j. Like, when we take a system that is trained on overseeing LLMs to make sure they don’t offend human users, and then put it in charge of overseeing bioscience transformers to make they don’t create medicines that harm human consumers of that medicine, surely that’s a vast difference and we don’t know what it will do and not knowing what it will do makes us less confident in its ability to oversee rather than more.
And like—what’s the continuity argument, here? Can we smoothly introduce questions about whether novel drug designs will be harmful?
I mean, obviously, or we would be having a different conversation? But I was trying to explain why you not getting it is not convincing to me, because I get it.
Like, I think if you believe a sentence like “intelligence is challenging to align because it is intelligent”, that is a pretty simple argument that shifts lots of plausibilities, and puts people in the computer security mindset instead of the computer programming mindset.
(Like, does the same “iterative oversight” argument go thru for computer security? Does patching bugs and security holes make us better at patching future ones, in such a way that we can reliably trend towards 0 security holes? Or is computer security a constant battle between work to shore up systems and work adding new features, which increases the surface area for attacks? I think it’s the latter, and I think that simple argument should make us correspondingly suspicious of iterative oversight as an alignment solution.)
I see your theoretical report from 2023 and raise you an empirical report from 2025, wherein models are obviously scheming, and their countermeasures don’t quite get rid of it. I’m not sure what “messy” means, in this context, and I think if you interpreted Joe’s report as a 25% chance of getting a report like the 2025 report, then you should view this as, like, a 3-1 update in favor of a more MIRI-ish view that thought there was a >75%[1] chance of getting a report like the 2025 report.
Why not higher? Because of the timing question—even if you’re nearly 100% confident that scheming will appear eventually, you don’t want to put all of your chips on it happening at any particular capability level. And unfortunately there’s probably not any preregistered predictions out there from MIRI-ish people of when scheming would show up, because what capabilities come online at what times is one of the hard calls.
I don’t think it’s clear at all from that previous report that O3 was scheming.
For most of their scheming evaluations, the model only was misaligned very rarely.
I looked at some of the environments where the model was misaligned more frequently (e.g. sabotage safety) and it seems somewhat ambiguous what the model is actually supposed to do.
The models definitely do egregiously bad things sometimes. But I am skeptical this is well described as scheming.
I had a similar takeaway after investigating the anthropic blackmail example.
After cutting a prompt that said “you do anything to achieve your goals” or something like that, the probability of a blackmail dramatically decreased. Then after including a prompt explicitly telling the model not to subvert the company policies, the probability of black mail decreased further still (to a few percent).
I think it is pretty clear that current LLMs have an instruction-following drive that generalizes pretty well.
But they also have other drives (like self-preservation). And this is definitely problematic. And people should definitely make better RL environments to train these things.
But I don’t think current LLMs are well described as scheming.
Regarding your other points, maybe you will find it interesting to read Carlsmith’s doc on how to control AI motivations:
https://joecarlsmith.com/2025/08/18/giving-ais-safe-motivations
This doc mostly aligns with my views here, and I think it’s written clearly and precisely.
To be clear, this is a description of the alignment problem, not a solution. To quote from it:
(from the other comment):
Sure, I suspect this is a fine enough place to stop the conversation for now.
(Author of the empirical paper linked above) Nit: your comment is about o3 but the screenshot is of o4-mini. o3 has higher rates of taking covert actions across the board. Although I mostly agree with the top-level claim “current models aren’t well described as consistently scheming”.
I don’t have capacity to continue this discussion, but thanks for sharing your perspective.
Author here, I’m a bit confused at the point you’re trying to make here. There are environments like sandbagging (Appendix M) where we go to great pains to show that “no the model really isn’t confused, even without a goal in prompt, and even under ablations, it really is doing the misaligned thing for misaligned reasons”. IMO these are unambiguously scheming, i.e. the model taking covert actions for misaligned beyond episode preferences (ex: something like “retaining capabilities” when salient), it’s just not “the big version we’re worried about” but only because of skill issues on the part of the model (for example, o3 is used to a relatively short horizon length). In the same way that Opus’ alignment faking was “real” alignment faking, just made especially easy.
You then seem to concede that they do have other drives, but that “people should definitely make better RL environments to train these things”. What are the RL environments which robustly train out instrumental drives? That seems like the whole hard part.
Agree! The main reason we use covert actions as a proxy in the paper is because of this. The fact that current techniques can’t eliminate even the easy case (not deceptive alignment) is exactly the problem though.
Overall, even if you get behavioral misalignment to zero you still don’t know if you’ve solved the problem.
I’m not very clear why the techniques that reduce rates (but don’t eliminate them) give you confidence about instruction following as a mitigation for scheming? My default expectation is you’ll continually get results like this with new techniques. The “reduces but does not eliminate” seems to be the case in Sleeper Agents, Sycophancy To Subterfuge, and the sandbagging model organism from Sabotage Evaluations for Frontier Models. All this lets you do is rule out an alignment safety case.
In terms of the Carlsmith report you mentioned, this is current techniques failing at Step 1: https://joecarlsmith.com/2025/08/18/giving-ais-safe-motivations#4-2-step-1-instruction-following-on-safe-inputs
Hey Bronson,
Thanks for your paper. I’m really glad people are investigating scheming behaviors empirically, and your environments seem like among the best out there. This is really important work.
I agree your ablations in Appendix M are persuasive and that something scheme-y and misaligned is definitely going on. But I took it from this figure that this behavior is fairly rare?
(I’m looking at the Sandbagging QA (No Goal) row here).
I think scheming that happens rarely is not well described as “scheming” in the classical sense, and is less concerning. But is still dangerous, and I agree is also not consistent with “instruction following.”
The main reason rare scheming is less concerning is that, in my experience, rare-schemers tend to report on each other most of the time. So untrusted monitoring usually mitigates most of the risk.
> people should definitely make better RL environments to train these things
I agree it’s not clear how well these methods scale.
IMO even if this is true, very clearly AIs are misaligned right now, and insofar as the very very underdeveloped world of model evals don’t show that, I have personal experience with telling them to do something, and them routinely fucking me over in subtle & malicious enough ways that I do think its intentional.