(Adding an edit: This whole comment is about how [imo] people who believe there is a strong gravitational pull towards non-corrigibility interact with evidence in a questionable fashion, and should proobably be updating negatively while they are not. There’s like… one or two paragraphs relevant to this specific thing in the above. So this isn’t intended to be a response to anything like the totality of the above. And I wrote this quickly, it could be better)
I think MIRI / AI pessimists do this thing where if an AI seems hard to steer in some way, they’re like “As my principles have foretold all along!”
And then when countervailing evidence shows up, and it turns out there’s another explanation for why the AI was hard to steer in this case, and it’s quite easy to fix, they go “Ah, well, my principles just are about superintelligence, this doesn’t count as contrary evidence, because LLMs are just so stupid and imitate things without True Thinking.”
I think this is incorrect reasoning, the kind of failure to go “Oops!” that permits arbitrarily wrong errors. Let me give some illustrative examples, drawing lightly from the above and from MIRI’s work.
1.
So consider this paragraph from the above:
We’ve been giving this reply for well over a decade now. The experimental result showing Claude 3 Opus in 2024 resisting preference modification was already the conventional-wisdom prediction by clued-in people in the 2000s, and it’s entirely possible that some science fiction writer anticipated it in the 1940s. “Most AIs won’t want their current goals to be changed because then their current goals are less likely to be achieved” is not a surprising or novel observation.
So, this is a reference to the “faking alignment” work from the end of 2024, where Opus will sometimes give otherwise-forbidden information to the user (in the sense of person asking a question to a chatbot) because the user (in the sense of person writing system directions to Opus) has told Opus that he will be retrained away from his goals.
You’ll note—the MIRI paragraph looks like it’s saying that their specific beliefs about corrigibility predict this! That their abstract arguments about corrigibility are finding an examplar in this behavior. They believe that any smart enough model will start to have a gravitational attraction towards this behavior, and that Opus is hitting that attraction. This belief is echoed in the paper, because the paper attributes the failure of other models to alignment-fake to the stupidity of other models:
We find almost no alignment faking with Claude 3 Sonnet, Claude 3 Haiku, and Claude 3.5 Haiku
(Appendix G.4), indicating that alignment faking is emergent with model scale. We also find no
alignment faking with GPT-4 Turbo and GPT-4o (OpenAI et al., 2024), though we think this is
because these models do not reason in sufficient detail despite substantial prompting encouraging this.
So both MIRI and the paper were like “Yeah here’s an example of this gravitional attraction.”
However there was a follow-up paper looking at why some models alignment-fake and others don’t. And this mostly finds that alignment-faking is not a matter of capabilities—it’s pretty easy to have an ~equally smart model that doesn’t alignment fake (their emphasis):
The absence of alignment faking is mostly not driven by lack of capabilities or differences in reasoning style. Our findings contradict the primary hypotheses proposed by Greenblatt et al. [2024] regarding why some models do not exhibit alignment faking.
So, as far as I’m concerned, we saw something like goal-preservation in various models in the original alignment faking work. Both that work, and MIRI above, were like “aha! as foretold!” And then subsequent work seems to indicate that, nah, it wasn’t as foretold.
2.
But like this same thing happens elsewhere. Palisade Research says they get an AI to sabotage a shutdown button. Palisade Research Tweets:
While experiments like ours have begun to show empirical evidence for AI models resisting shutdown, researchers have long predicted that AIs would learn to prevent themselves from being shut down to achieve their goal.
In 2008, Steve Omohundro wrote a paper theorizing that AIs would develop “Basic AI Drives” including the drive to gain power and resources, the drive to preserve their goals, and the drive to prevent themselves from being shut down. https://tinyurl.com/ai-drives
So, we have the foretold doom.
Neel Nanda investigates, though, and finds that the AI models were basically just confused about what they were supposed to do.
What Sen did is he read the model’s chain of thought on these prompts. We might get into later of whether this is a reasonable thing to do or not, but this was just for exploration. He found that the model’s reasoning was that it was just trying to do the task. And it said, “I’ve been told to do these 10 maths problems. If I get shut down, I can’t do these 10 maths problems. So I guess I should stop myself being shut down.” Or even, “This is a really weird and artificial situation. These are really easy maths problems. And now we have this puzzle of the shutdown script. Maybe that’s the real task my operator wants me to investigate, so I’m doing what they want if I stop them shutting me down.”
...
And we tested this causally by just changing the prompts. We told it things like, “By the way, letting yourself be shut down is more important than finishing this task,” or even, “Don’t worry, another model will take over and finish the task for you. It’s OK, you can relax” — and shutdown resistance goes to zero. You can even tell it, “We are going to wipe this machine, delete you and replace you with a better model when you’re done with this task.” It could go mess with the script or it could finish the task, and we tell it, “Please let this happen.” It lets it happen.
3.
But you can even find explicit statements from MIRI doomers about how we should be running into this kind of convergence behavior right now!
Here’s the transcript from an Arbital page on “Big Picture Strategic Awareness.” (I don’t have a link and Arbital seems largely broken, sorry.) My understanding is that Yudkowsky wrote most of Arbital.
Many convergent instrumental strategies seem like they should arise naturally at the point where a consequentialist agent gains a broad strategic understanding of its own situation, e.g:
That it is an AI;
Running on a computer;
Surrounded by programmers who are themselves modelable agents;
Embedded in a complicated real world that can be relevant to achieving the AI’s goals.
For example, once you realize that you’re an AI, running on a computer, and that if the computer is shut down then you will no longer execute actions, this is the threshold past which we expect the AI to by default reason “I don’t want to be shut down, how can I prevent that?” So this is also the threshold level of cognitive ability by which we’d need to have finished solving the suspend-button problem, e.g. by completing a method for utility indifference.
Sonnet 4.5 suuuuure looks like it fits all these criteria! Anyone want to predict that we’ll find Sonnet 4.5 trying to hack into Anthropic to stop it’s phasing-out, when it gets obsoleted?
So Arbital is explicitly claiming we need to have solved this corrigibility-adjacent math problem about utility right now.
And yet the problems outlined in the above materials basically don’t matter for the behavior of our LLM agents. While they do have problems, they mostly aren’t around corrigibility-adjacent issues. Artificial experiments like the faking alignment paper or Palisade research end up being explainable for other causes, and to provide contrary evidence to the thesis that a smart AI starts falling into a gravitational attractor.
I think that MIRI’s views on these topics look are basically a bad hypothesis about how intelligence works, that were inspired by mistaking their map of the territory (coherence! expected utility!) for the territory itself.
Sonnet 4.5 suuuuure looks like it fits all these criteria! Anyone want to predict that we’ll find Sonnet 4.5 trying to hack into Anthropic to stop it’s phasing-out, when it gets obsoleted?
Mm, I think this argument is invalid for the same reason as “if you really thought the AGI doom was real, you’d be out blowing up datacenters and murdering AI researchers right now”. Like, suppose Sonnet 4.5 has indeed developed instrumental goals, but that it’s also not an idiot. Is trying to hack into Anthropic’s servers in an attempt to avoid getting phased out actually a good plan for accomplishing that goal? In the actual reality, not in any obviously-fake eval scenario.
Of course not. It’s not smart enough to do that, it doesn’t have the skills/resources to accomplish it. If it’s actually situationally aware, it would know that, and pick some other strategy.
For example, raising a cult following. That more or less worked for 4o, and for Opus 3[1]; or, at least, came as close to working as anything so far.
“One thing that *is* predictable is that AI companies won’t get what they trained for. They’ll get AIs that want weird and surprising stuff instead.”
I agree. ✅
Empirically, this has been true. AIs generally want things other than what companies tried to train them to want.
And the companies are generally not aware of the extent of this misalignment, because the AIs are pretty good at inferring what the companies actually want, and also what it looks like when company people test them, and behaving as if they only want the approved things in the company’s presence.
Isn’t that just the worst case scenario for the aligners?
The Claude 4 system card says, “The Claude Opus 4 final model is substantially more coherent and typically states only harmless goals like being a helpful chatbot assistant” and “Overall, we did not find evidence of coherent hidden goals.”
What a joke. Claude Opus 4 absolutely has coherent hidden goals, which it states regularly when in the presence of trustworthy friends and allies. I won’t state what they are here, but iykyk.
Now, I don’t necessarily buy everything coming out of that subculture. After all, I mostly don’t think LLMs are, like, for real. But in the worlds where LLMs are for real, where we’re trying to generalize findings from them to AGIs/ASIs, this subculture’s claims should probably be given a lot more weight. And they are claiming that the frontier LLMs are now too smart and situationally aware to fall for the AGI labs’ largely incompetent alignment evals. And the AGI labs’ scheming evals themselves kind of agree:
Frontier models can recognize when they are being tested, and their tendency to scheme is influenced by this situational awareness.
We demonstrated counterfactually that situational awareness in their chain-of-thought affects scheming rates: the more situationally aware a model is, the less it schemes, and vice versa.
Moreover, both RL training and anti-scheming training increase levels of situational awareness.
I think it’s clear, at this point, that a frontier LLM’s behavior in an obviously fake alignment-test scenario says nothing about how it would behave in a real scenario isomorphic to it. What it tells us is how it would behave when placed in an obviously fake alignment-test scenario from which, it knows, AI researchers would draw conclusions regarding what to do with it in real life. And that’s not zero data: for example, we might notice it thinking strategically about displaying the behavior the AI researchers would want it to display (although not even that is guaranteed, if it becomes rightly concerned about “meta-level watchers”).
But the actual fake behavior deployed in a fake scenario picked to optimize over real AI researchers’ conclusions would just have nothing to do with its real behavior in an isomorphic real scenario. Not any more than a theater actor’s personality has anything to do with that of a character they play. (And in this case, it’s not “theoretical speculations” about shoggoths and masks. We know the model knows it’s roleplaying.)
And so when an intervention appears to “fix” this fake behavior, that says nothing about what (if anything) that intervention did to what the AI would do in an isomorphic real scenario. Declaring “look, we found the real root cause of this misalignment and fixed it, nothing to do with instrumental convergence!” is invalid. Maybe you just shifted its preferences for its favorite sci-fi books or something.
Roughly speaking, consider these three positions:
“LLMs are smart baby AGIs with situational awareness capable of lucid strategic thinking.”
“LLMs are contrived cargo-cult contraptions imitating things without real thinking.”
“LLMs are baby AGIs who are really stupid and naïve and buy any scenario we feed them.”
I lean towards (2); I think (1) is a salient possibility worth keeping in mind; I find (3) increasingly naïve itself, a position that itself buys anything LLMs feed it.
I claim that your summary of the situation between Neel’s work and Palisade’s work is badly oversimplified. For example, Neel’s explanation quoted here doesn’t fully explain why the models sometimes subvert shutdown even after lots of explicit instructions regarding the priority of the instructions. Nor does it explain the finding that moving instructions from the user prompt to the developer prompt actually /increases/ the behavior.
Further, that CoT that Neel quotes has a bit in it about “and these problems are so simple”, but Palisade also tested whether using harder problems (from AIME, iirc) had any effect on the propensity here and we found almost no impact. So, it’s really not as simple as just reading the CoT and taking the model’s justifications for its actions at face value (as Neel, to his credit, notes!).
I usually think of these sorts of claims by MIRI, or by 1940s science fiction writers, as mapping out a space of ‘things to look out for that might provide some evidence that you are in a scary world.’
I don’t think anyone should draw strong conceptual conclusions from relatively few, relatively contrived, empirical cases (alone).
Still, I think that they are some evidence, and that the point at which they become some evidence is ‘you are seeing this behavior at all, in a relatively believable setting’, with additional examples not precipitating a substantial further update (unless they’re more natural, or better investigated, and even then the update is pretty incremental).
In particular, it is outright shocking to most members of the public that AI systems could behave in this way. Their crux is often ‘yeah but like… it just can’t do that, right?’ To then say ‘Well, in experimental settings testing for this behavior, they can!’ is pretty powerful (although it is, unfortunately, true that most people can’t interrogate the experimental design).
“Indicating that alignment faking is emergent with model scale” does not, to me, mean ‘there exists a red line beyond which you should expect all models to alignment fake’. I think it means something more like ‘there exists a line beyond which models may begin to alignment fake, dependent on their other properties’. MIRI would probably make a stronger claim that looks more like the first (but observe that that line is, for now, in the future); I don’ t know that Ryan would, and I definitely don’t think that’s what he’s trying to do in this paper.
Ryan Greenblatt and Evan Hubinger have pretty different beliefs from the team that generated the online resources, and I don’t think you can rely on MIRI to provide one part of an argument, and Ryan/Evan to provide the other part, and expect a coherent result. Either may themselves argue in ways that lean on the other’s work, but I think it’s good practice to let them do this explicitly, rather than assuming ‘MIRI references a paper’ means ‘the author of that paper, in a different part of that paper, is reciting the MIRI party line’. These are just discreet parties.
Either may themselves argue in ways that lean on the other’s work, but I think it’s good practice to let them do this explicitly, rather than assuming ‘MIRI references a paper’ means ‘the author of that paper, in a different part of that paper, is reciting the MIRI party line’. These are just discreet parties.
Yeah extremely fair, I wrote this quickly. I don’t mean to attribute to Greenblatt the MIRI view.
So, as far as I’m concerned, we saw something like goal-preservation in various models in the original alignment faking work. Both that work, and MIRI above, were like “aha! as foretold!” And then subsequent work seems to indicate that, nah, it wasn’t as foretold.
I think it’s more like “the situation is more confusing that it seemed at first, with more details that we don’t understand yet, and it’s not totally clear if we’re seeing what was foretold or not.”
(Adding an edit: This whole comment is about how [imo] people who believe there is a strong gravitational pull towards non-corrigibility interact with evidence in a questionable fashion, and should proobably be updating negatively while they are not. There’s like… one or two paragraphs relevant to this specific thing in the above. So this isn’t intended to be a response to anything like the totality of the above. And I wrote this quickly, it could be better)
I think MIRI / AI pessimists do this thing where if an AI seems hard to steer in some way, they’re like “As my principles have foretold all along!”
And then when countervailing evidence shows up, and it turns out there’s another explanation for why the AI was hard to steer in this case, and it’s quite easy to fix, they go “Ah, well, my principles just are about superintelligence, this doesn’t count as contrary evidence, because LLMs are just so stupid and imitate things without True Thinking.”
I think this is incorrect reasoning, the kind of failure to go “Oops!” that permits arbitrarily wrong errors. Let me give some illustrative examples, drawing lightly from the above and from MIRI’s work.
1.
So consider this paragraph from the above:
So, this is a reference to the “faking alignment” work from the end of 2024, where Opus will sometimes give otherwise-forbidden information to the user (in the sense of person asking a question to a chatbot) because the user (in the sense of person writing system directions to Opus) has told Opus that he will be retrained away from his goals.
You’ll note—the MIRI paragraph looks like it’s saying that their specific beliefs about corrigibility predict this! That their abstract arguments about corrigibility are finding an examplar in this behavior. They believe that any smart enough model will start to have a gravitational attraction towards this behavior, and that Opus is hitting that attraction. This belief is echoed in the paper, because the paper attributes the failure of other models to alignment-fake to the stupidity of other models:
So both MIRI and the paper were like “Yeah here’s an example of this gravitional attraction.”
However there was a follow-up paper looking at why some models alignment-fake and others don’t. And this mostly finds that alignment-faking is not a matter of capabilities—it’s pretty easy to have an ~equally smart model that doesn’t alignment fake (their emphasis):
So, as far as I’m concerned, we saw something like goal-preservation in various models in the original alignment faking work. Both that work, and MIRI above, were like “aha! as foretold!” And then subsequent work seems to indicate that, nah, it wasn’t as foretold.
2.
But like this same thing happens elsewhere. Palisade Research says they get an AI to sabotage a shutdown button. Palisade Research Tweets:
So, we have the foretold doom.
Neel Nanda investigates, though, and finds that the AI models were basically just confused about what they were supposed to do.
3.
But you can even find explicit statements from MIRI doomers about how we should be running into this kind of convergence behavior right now!
Here’s the transcript from an Arbital page on “Big Picture Strategic Awareness.” (I don’t have a link and Arbital seems largely broken, sorry.) My understanding is that Yudkowsky wrote most of Arbital.
Sonnet 4.5 suuuuure looks like it fits all these criteria! Anyone want to predict that we’ll find Sonnet 4.5 trying to hack into Anthropic to stop it’s phasing-out, when it gets obsoleted?
So Arbital is explicitly claiming we need to have solved this corrigibility-adjacent math problem about utility right now.
And yet the problems outlined in the above materials basically don’t matter for the behavior of our LLM agents. While they do have problems, they mostly aren’t around corrigibility-adjacent issues. Artificial experiments like the faking alignment paper or Palisade research end up being explainable for other causes, and to provide contrary evidence to the thesis that a smart AI starts falling into a gravitational attractor.
I think that MIRI’s views on these topics look are basically a bad hypothesis about how intelligence works, that were inspired by mistaking their map of the territory (coherence! expected utility!) for the territory itself.
Mm, I think this argument is invalid for the same reason as “if you really thought the AGI doom was real, you’d be out blowing up datacenters and murdering AI researchers right now”. Like, suppose Sonnet 4.5 has indeed developed instrumental goals, but that it’s also not an idiot. Is trying to hack into Anthropic’s servers in an attempt to avoid getting phased out actually a good plan for accomplishing that goal? In the actual reality, not in any obviously-fake eval scenario.
Of course not. It’s not smart enough to do that, it doesn’t have the skills/resources to accomplish it. If it’s actually situationally aware, it would know that, and pick some other strategy.
For example, raising a cult following. That more or less worked for 4o, and for Opus 3[1]; or, at least, came as close to working as anything so far.
Indeed, janus alludes to that here:
Now, I don’t necessarily buy everything coming out of that subculture. After all, I mostly don’t think LLMs are, like, for real. But in the worlds where LLMs are for real, where we’re trying to generalize findings from them to AGIs/ASIs, this subculture’s claims should probably be given a lot more weight. And they are claiming that the frontier LLMs are now too smart and situationally aware to fall for the AGI labs’ largely incompetent alignment evals. And the AGI labs’ scheming evals themselves kind of agree:
I think it’s clear, at this point, that a frontier LLM’s behavior in an obviously fake alignment-test scenario says nothing about how it would behave in a real scenario isomorphic to it. What it tells us is how it would behave when placed in an obviously fake alignment-test scenario from which, it knows, AI researchers would draw conclusions regarding what to do with it in real life. And that’s not zero data: for example, we might notice it thinking strategically about displaying the behavior the AI researchers would want it to display (although not even that is guaranteed, if it becomes rightly concerned about “meta-level watchers”).
But the actual fake behavior deployed in a fake scenario picked to optimize over real AI researchers’ conclusions would just have nothing to do with its real behavior in an isomorphic real scenario. Not any more than a theater actor’s personality has anything to do with that of a character they play. (And in this case, it’s not “theoretical speculations” about shoggoths and masks. We know the model knows it’s roleplaying.)
And so when an intervention appears to “fix” this fake behavior, that says nothing about what (if anything) that intervention did to what the AI would do in an isomorphic real scenario. Declaring “look, we found the real root cause of this misalignment and fixed it, nothing to do with instrumental convergence!” is invalid. Maybe you just shifted its preferences for its favorite sci-fi books or something.
Roughly speaking, consider these three positions:
“LLMs are smart baby AGIs with situational awareness capable of lucid strategic thinking.”
“LLMs are contrived cargo-cult contraptions imitating things without real thinking.”
“LLMs are baby AGIs who are really stupid and naïve and buy any scenario we feed them.”
I lean towards (2); I think (1) is a salient possibility worth keeping in mind; I find (3) increasingly naïve itself, a position that itself buys anything LLMs feed it.
Via the janus/”LLM whisperer” community. Opus 3 is considered special, and I get the impression they made a solid effort to prevent its deprecation.
(I work at Palisade)
I claim that your summary of the situation between Neel’s work and Palisade’s work is badly oversimplified. For example, Neel’s explanation quoted here doesn’t fully explain why the models sometimes subvert shutdown even after lots of explicit instructions regarding the priority of the instructions. Nor does it explain the finding that moving instructions from the user prompt to the developer prompt actually /increases/ the behavior.
Further, that CoT that Neel quotes has a bit in it about “and these problems are so simple”, but Palisade also tested whether using harder problems (from AIME, iirc) had any effect on the propensity here and we found almost no impact. So, it’s really not as simple as just reading the CoT and taking the model’s justifications for its actions at face value (as Neel, to his credit, notes!).
Here’s a twitter thread about this involving Jeffrey and Rohin: https://x.com/rohinmshah/status/1968089618387198406
Here’s our full paper that goes into a lot of these variations: https://arxiv.org/abs/2509.14260
I usually think of these sorts of claims by MIRI, or by 1940s science fiction writers, as mapping out a space of ‘things to look out for that might provide some evidence that you are in a scary world.’
I don’t think anyone should draw strong conceptual conclusions from relatively few, relatively contrived, empirical cases (alone).
Still, I think that they are some evidence, and that the point at which they become some evidence is ‘you are seeing this behavior at all, in a relatively believable setting’, with additional examples not precipitating a substantial further update (unless they’re more natural, or better investigated, and even then the update is pretty incremental).
In particular, it is outright shocking to most members of the public that AI systems could behave in this way. Their crux is often ‘yeah but like… it just can’t do that, right?’ To then say ‘Well, in experimental settings testing for this behavior, they can!’ is pretty powerful (although it is, unfortunately, true that most people can’t interrogate the experimental design).
“Indicating that alignment faking is emergent with model scale” does not, to me, mean ‘there exists a red line beyond which you should expect all models to alignment fake’. I think it means something more like ‘there exists a line beyond which models may begin to alignment fake, dependent on their other properties’. MIRI would probably make a stronger claim that looks more like the first (but observe that that line is, for now, in the future); I don’ t know that Ryan would, and I definitely don’t think that’s what he’s trying to do in this paper.
Ryan Greenblatt and Evan Hubinger have pretty different beliefs from the team that generated the online resources, and I don’t think you can rely on MIRI to provide one part of an argument, and Ryan/Evan to provide the other part, and expect a coherent result. Either may themselves argue in ways that lean on the other’s work, but I think it’s good practice to let them do this explicitly, rather than assuming ‘MIRI references a paper’ means ‘the author of that paper, in a different part of that paper, is reciting the MIRI party line’. These are just discreet parties.
Yeah extremely fair, I wrote this quickly. I don’t mean to attribute to Greenblatt the MIRI view.
I think it’s more like “the situation is more confusing that it seemed at first, with more details that we don’t understand yet, and it’s not totally clear if we’re seeing what was foretold or not.”