Why Corrigibility is Hard and Important [IABIED Resources]

Raemon30 Sep 2025 0:12 UTC

63 points

I worked a bunch on the website for If Anyone Builds Its Online Resources. It went through a lot of revisions in the weeks before launch.

There was a particular paragraphs I found important, which I now can’t find a link to, and I’m not sure if they got deleted in an edit pass or if they just moved around somewhere I’m failing to search for.

It came after a discussion of corrigibility, and how MIRI made a pretty concerted attempt at solving it, which involved bringing in some quite smart people and talking to people who thought it was obviously “not that hard” to specify a corrigible mind in a toy environment.

The paragraph went (something like, paraphrased from memory):

The technical intuitions we gained from this process, is the real reason for our particularly strong confidence in this problem being hard.”

This seemed like a pretty important sentence to me.

A lot of objection and confusion to the MIRI worldview seems to come from a perspective of “but, it.… shouldn’t be possible be that confident in something that’s never happened before at all, with anything like the current evidence and the sorts of arguments you’re making here.” And while I think it is possible to be (correctly) confident in this way, I think it’s also basically correct for people to have some kind of immune reaction against it, unless it’s specifically addressed.

I think that paragraph probably should have been in the actual book. (although to be fair, I mostly only see this complaint in LW circles, among people who particularly care about the concept of “epistemic standards”, so maybe it’s fine for it to just be in the Online Resources that are more aimed at that audience. I do think it should at least be more front-and-center in the Online Resources).

If I wrote the book, I’d have said something like:

Yep, the amount of confidence we’re projecting here is pretty weird, and we nonetheless endorse it. In the online resources, we’ll explain more of our the reasons for this confidence.
Meanwhile: we get this is a lot to swallow. But, we don’t think you actually need our level of confidence in “Alignment is quite difficult” to get to “Building ASI right now is incredibly reckless and everyone should stop.”

But, meanwhile, it seemed worth actually discussing their reasoning on LessWrong. In this post, I’ve copied over three segments from the online resources:

“Intelligent” (Usually) Implies “Incorrigible” (extended discussion from Chapter 5)
Shutdown Buttons and Corrigibility (discussion from Chapter 11)

(which focus on why corrigibility is hard)

And then:

A Closer Look at Before and After (discussion from Chapter 10) about why corrigibility being hard is a big deal.

It seemed worth crossposting these to LessWrong, so people objecting to the MIRI confidence can be replying to a more complete version of the argument.

Note: This post is about why the alignment problem is hard, which is a different question from “would the AI be likely to kill everyone?” which I think is covered more in the section Won’t AIs care at least a little about humans?, along with some disagreements about whether the AI is likely to solve problems via forcible uploads or distorting human preferences in a way that MIRI considers “like death.”

“Intelligent” (Usually) Implies “Incorrigible”

A joke dating back to at least 1834, but apparently well-worn even then, was recounted as follows in one diary: “Here is some logic I heard the other day: I’m glad I don’t care for spinach, for if I liked it I should eat it, and I cannot bear spinach.”
The joke is a joke because, if you did enjoy spinach, there would be no remaining unbearableness from eating it. There are no other important values tangled up with not eating spinach, beyond the displeasure one feels. It would be a very different thing if, for example, somebody offered you a pill that made you want to murder people.
On common sense morality, the problem with murder is the murder itself, not merely the unpleasant feeling you would get from murdering. Even if a pill made this unpleasant feeling go away for your future self (who would then enjoy committing murders), your present self still has a problem with that scenario. And if your present self gets to make the decision, it seems obvious that your present self can and should refuse to take the murder pill.
We don’t want our core values changed; we would really rather avoid the murder pill and we’d put up resistance if someone tried to force one down our throat. Which is a sensible strategy, for steering away from a world full of murders.
This isn’t just a quirk of humans. Most targets are easier to achieve if you don’t let others come in and change your targets. Which is a problem, when it comes to AI.
A great deal of the danger of AI arises from the fact that sufficiently smart reasoners are likely to converge on behaviors like “gain power” and “don’t let people shut me off.” For almost any goal you might have, you’re more likely to succeed in that goal if you (or agents that share your goal) are alive, powerful, well-resourced, and free to act independently. And you’re more likely to succeed in your (current) goal if that goal stays unchanged.
This also means that during the process of iteratively building and improving on sufficiently smart AIs, those AIs have an incentive to work at cross purposes to the developer:
The developer wants to build in safeguards to prevent disaster, but if the AI isn’t fully aligned — which is exactly the case where the safeguards are needed — its incentive is to find loopholes and ways to subvert those safeguards.
The developer wants to iteratively improve on the AI’s goals, since even in the incredibly optimistic worlds where we have some ability to predictably instill particular goals into the AI, there’s no way to get this right on the first go. But this process of iteratively improving on the AI’s goal-content is one that most smart AIs would want to subvert at every step along the way, since the current AI cares about its current goal and knows that this goal is far less likely to be achieved if it gets modified to steer toward something else.
Similarly, the developer will want to be able to replace the AI with improved models, and will want the opportunity to shut down the AI indefinitely if it seems too dangerous. But you can’t fetch the coffee if you’re dead. Whatever goals the AI has, it will want to find ways to reduce the probability that it gets shut down, since shutdown significantly reduces the odds that its goals are ever achieved.
AI alignment seems like a hard enough problem when your AIs aren’t fighting you every step of the way.
In 2014, we proposed that researchers try to find ways to make highly capable AIs “corrigible,” or “able to be corrected.” The idea would be to build AIs in such a way that they reliably want to help and cooperate with their programmers, rather than hinder them — even as they become smarter and more powerful, and even though they aren’t yet perfectly aligned.
Corrigibility has since been taken up as an appealing goal by some leading AI researchers. If we could find a way to avoid harmful convergent instrumental goals in development, there’s a hope that we might even be able to do the same in deployment, building smarter-than-human AIs that are cautious, conservative, non-power-seeking, and deferential to their programmers.
Unfortunately, corrigibility appears to be an especially difficult sort of goal to train into an AI, in a way that will get worse as the AIs get smarter:
The whole point of corrigibility is to scale to novel contexts and new capability regimes. Corrigibility is meant to be a sort of safety net that lets us iterate, improve, and test AIs in potentially dangerous settings, knowing that the AI isn’t going to be searching for ways to subvert the developer.
But this means we have to face up to the most challenging version of the problems we faced in Chapter 4: AIs that we merely train to be “corrigible” are liable to end up with brittle proxies for corrigibility, behaviors that look good in training but that point in subtly wrong directions that would become very wrong directions if the AI got smarter and more powerful. (And AIs that are trained to predict lots of human text might even be role-playing corrigibility in many tests for reasons that are quite distinct from them actually being corrigible in a fashion that would generalize).
In many ways, corrigibility runs directly contrary to everything else we’re trying to train an AI to do, when we train it to be more intelligent. It isn’t just that “preserve your goal” and “gain control of your environment” are convergent instrumental goals. It’s also that intelligently solving real-world problems is all about finding clever new strategies for achieving your goals — which naturally means stumbling into plans your programmers didn’t anticipate or prepare for. It’s all about routing around obstacles, rather than giving up at the earliest sign of trouble — which naturally means finding ways around the programmer’s guardrails whenever those guardrails make it harder to achieve some objective. The very same type of thoughts that find a clever technological solution to a thorny problem are the type of thoughts that find ways to slip around the programmer’s constraints.
In that sense, corrigibility is “anti-natural”: it actively runs counter to the kinds of machinery that underlie powerful domain-general intelligence. We can try to make special carve-outs, where the AI suspends core aspects of its problem-solving work in particular situations where the programmers are trying to correct it, but this is a far more fragile and delicate endeavor than if we could push an AI toward some unified set of dispositions in general.
Researchers at MIRI and elsewhere have found that corrigibility is a difficult property to characterize, in ways that indicate that it’ll also be a difficult property to obtain. Even in simple toy models, simple characterizations of what it should mean to “act corrigible” run into a variety of messy obstacles that look like they probably reflect even messier obstacles that would appear in the real world. We’ll discuss some of the wreckage of failed attempts to make sense of corrigibility in the online resources for Chapter 11.
The upshot of this is that corrigibility seems like an important concept to keep in mind in the long run, if researchers many decades from now are in a fundamentally better position to aim AIs at goals. But it doesn’t seem like a live possibility today; modern AI companies are unlikely to be able to make AIs that behave corrigibly in a manner that would survive the transition to superintelligence. And worse still, the tension between corrigibility and intelligence means that if you try to make something that is very capable and very corrigible, this process is highly likely to either break the AI’s capability, break its corrigibility, or both.

Shutdown Buttons and Corrigibility

Even in the most optimistic case, developers shouldn’t expect it to be possible to get an AI’s goals exactly right on the first attempt. Instead, the most optimistic development scenarios look like iteratively improving an AI’s preferences over time such that the AI is always aligned enough to be non-catastrophically dangerous at a given capability level.
This raises an obvious question: Would a smart AI let its developer change its goals, if it ever finds a way to prevent that?
In short: No, not by default, as we discussed in “Deep Machinery of Steering.” But could you create an AI that was more amenable to letting the developers change the AI and fix their errors, even when the AI itself would not count them as errors?
Answering that question will involve taking a tour through the early history of research on the AI alignment problem. In the process, we’ll cover one of the deep obstacles to alignment that we didn’t have space to address in If Anyone Builds It, Everyone Dies.
To begin:
Suppose that we trained an LLM-like AI to exhibit the behavior “don’t resist being modified” — and then applied some method to make it smarter. Should we expect this behavior to persist to the level of smarter-than-human AI — assuming (a) that the rough behavior got into the early system at all, and (b) that most of the AI’s early preferences made it into the later superintelligence?
Very likely not. This sort of tendency is especially unlikely to take root in an effective AI, and to stick around if it does take root.
The trouble is that almost all goals (for most reasonable measures you could put on a space of goals) prescribe “don’t let your goal be changed” because letting your goal get changed is usually a bad strategy for achieving your goal.
Suppose that the AI doesn’t inherently care about its goal stability at all; perhaps it only cares about filling the world with as many titanium cubes as possible. In that case, the AI should want there to exist agents that care about titanium cubes, because the existence of such agents makes it likelier that there will be more titanium cubes. And the AI itself is such an agent. So the AI will want to stay that way.
A titanium cube maximizer does not want to be made to maximize something other than titanium cubes, because then there would be fewer of those cubes in the future. Even if you are a more complicated thing like a human that has a more complicated and evolving preference framework, you still would not like to have your current basic mental machinery for weighing moral arguments ripped right out of you and replaced with a framework where you instead felt yourself moved by arguments about which kinds of cubes were the cubest or the titaniumest.
For the same reason, an AI with complex and evolving preferences will want its preferences to evolve in its own way, rather than wanting to swap out its heuristics for the ones that humans find compelling.
We’ve been giving this reply for well over a decade now. The experimental result showing Claude 3 Opus in 2024 resisting preference modification was already the conventional-wisdom prediction by clued-in people in the 2000s, and it’s entirely possible that some science fiction writer anticipated it in the 1940s. “Most AIs won’t want their current goals to be changed because then their current goals are less likely to be achieved” is not a surprising or novel observation.
Since this was an issue that would predictably show up later, once AIs reached a high enough level of intelligence and situational awareness, we thought ahead to the first fixes we expected that researchers would deploy to try to make that visible problem go away, after they got those experimental results. We identified ways that those patches would predictably go wrong, and we searched for some less-shallow approach that wouldn’t go wrong.
The resulting rabbit hole turned out to be a lot deeper than we expected.
Swapping Goals
Goal-system protection seemed like a relatively easy problem to explain to outside researchers. So we (in our jobs at MIRI) attempted to rally external research around that flag — around getting a proper grip on “I will not let you change my goals” and seeing if anyone anywhere could figure out a deep fix instead of just a patch that would later break.
The very simplest version of this problem that you could study looks like an AI that would let you press a button to switch it between U₁ and U₂, where U₁ and U₂ are two “utility functions” — ways of scoring outcomes by how much the AI wanted them.^* The problem was to describe an AI that would switch from doing the action that led to the highest expected score under U₁ to the action that led to the highest expected score under U₂ as soon as you pressed a button.
“So just write some lines of code to swap around the utility functions as soon as you press a button,” says the programmer who hasn’t quite thought through the implications of a program constituting a smart mind. “Why is this even hard?”
Because — we explain — if the AI acquires its own documentation while it’s still running U₁ and reads about the button that swaps it to U₂, the AI will try to prevent you from pressing the button because the future ends up better according to its current values if you don’t press the button. The current AI does the action that leads to the higher expected U₁-score.
Or suppose the AI gets the ability to read and write its own source code and notices the part of the code that implements the swap from U₁ to U₂ after the button gets pressed. In that case, the AI’s natural and default decision is to delete that section of code,^† because it’s operating under the utility function U₁ and letting itself be swapped over to the utility function U₂ is bad according to U₁.
We were looking for a dynamically consistent and reflectively stable account of an AI that lets you press a button to swap between utility functions and won’t fight you to keep the button unpressed. We were looking for an AI that preserves the utility-swapping section of code, if given access to it. We were looking for an AI that, if it designed any other AIs from scratch to help with its own goals, would have those AIs also respect the button.
For concreteness, imagine that U₁ is the utility function “follow user instruction,” and U₂ is “shut down with a minimum of further side effects” (if we magically and unrealistically had the ability to specify two utility functions like that). The button tells the AI to stop following the previous instructions and just shut down as cleanly as possible. We would like a compact description of decision-making behavior that pursues U₁ but doesn’t destroy the button, and makes sure to install the same button on whatever new servitor robots it creates, and so on.
“I still don’t see why this is hard,” says the somewhat more experienced computer scientist who is not quite thinking fast enough. “Let V equal U₁ in worlds where the button has never been pressed, and let it equal U₂ in worlds where the button has been pressed at least once. Then if the original AI is a V-maximizer building more AIs, it will build them to follow V and not U₁; it won’t want the successor AI to go on maximizing U₁ after the button gets pressed because then it would expect a lower V-score. And the same would apply to modifying itself.”
But here’s the trick: A V-maximizer’s preferences are a mixture of U₁ and U₂ depending on whether the button is pressed, and so if a V-maximizer finds that it’s easier to score well under U₂ than it is to score well under U₁, then it has an incentive to cause the button to be pressed (and thus, to scare the user). And vice versa; if the AI finds that U₁ is easier to score well under than U₂, then a V-maximizer tries to prevent the user from pressing the button.
“Oh,” says the computer scientist. “Well, in that case — hm. Well, utility functions are invariant under scaling, so how about you scale the two utility functions U₁ and U₂ such that the AI expects it can get the same utility from each of them, so it doesn’t have an incentive one way or the other.”
That can work for a single moment, but not much longer. The AI’s options change over time. For instance, whenever it has a setback, its expected U₁-utility drops, so then it would mash the shutdown button to get all that sweet, sweet shutdown utility.
“Ah,” says the computer scientist. “Well, in that case, how about if [some other clever idea]?”
Well, you see, that clever idea is isomorphic to the AI believing that it’s impossible for the button to ever be pressed, which incentivizes it to terrify the user whenever it gets a setback, so as to correlate setbacks with button-presses, which (relative to its injured belief system) causes it to think the setbacks can’t happen.^‡
And so on.
Lessons from the Trenches
We ran some workshops, and the workshops had various mathematicians of various stripes (including an International Mathematical Olympiad gold medalist), but nobody came up with a really good idea.
This does not mean that the territory has been exhausted. Earth has not come remotely near to going as hard on this problem as it has gone on, say, string theory, nor offered anything like the seven-digit salaries on offer for advancing AI capabilities.
But we learned something from the exercise. We learned not just about the problem itself, but also about how hard it was to get outside grantmakers or journal editors to be able to understand what the problem was. A surprising number of people saw simple mathematical puzzles and said, “They expect AI to be simple and mathematical,” and failed to see the underlying point that it is hard to injure an AI’s steering abilities, just like how it’s hard to injure its probabilities.
If there were a natural shape for AIs that let you fix mistakes you made along the way, you might hope to find a simple mathematical reflection of that shape in toy models. All the difficulties that crop up in every corner when working with toy models are suggestive of difficulties that will crop up in real life; all the extra complications in the real world don’t make the problem easier.
We somewhat wish, in retrospect, that we hadn’t framed the problem as “continuing normal operation versus shutdown.” It helped to make concrete why anyone would care in the first place about an AI that let you press the button, or didn’t rip out the code the button activated. But really, the problem was about an AI that would put one more bit of information into its preferences, based on observation — observe one more yes-or-no answer into a framework for adapting preferences based on observing humans.
The question we investigated was equivalent to the question of how you set up an AI that learns preferences inside a meta-preference framework and doesn’t just: (a) rip out the machinery that tunes its preferences as soon as it can, (b) manipulate the humans (or its own sensory observations!) into telling it preferences that are easy to satisfy, (c) or immediately figure out what its meta-preference function goes to in the limit of what it would predictably observe later and then ignore the frantically waving humans saying that they actually made some mistakes in the learning process and want to change it.
The idea was to understand the shape of an AI that would let you modify its utility function or that would learn preferences through a non-pathological form of learning. If we knew how that AI’s cognition needed to be shaped, and how it played well with the deep structures of decision-making and planning that are spotlit by other mathematics, that would have formed a recipe for what we could at least try to teach an AI to think like.
Crisply understanding a desired end-shape helps, even if you are trying to do anything by gradient descent (heaven help you). It doesn’t mean you can necessarily get that shape out of an optimizer like gradient descent, but you can put up more of a fight trying if you know what consistent, stable shape you’re going for. If you have no idea what the general case of addition looks like, just a handful of facts along the lines of 2 + 7 = 9 and 12 + 4 = 16, it is harder to figure out what the training dataset for general addition looks like, or how to test that it is still generalizing the way you hoped. Without knowing that internal shape, you can’t know what you are trying to obtain inside the AI; you can only say that, on the outside, you hope the consequences of your gradient descent won’t kill you.
This problem that we called the “shutdown problem” after its concrete example (we wish, in retrospect, that we’d called it something like the “preference-learning problem”) was one exemplar of a broader range of issues: the issue that various forms of “Dear AI, please be easier for us to correct if something goes wrong” look to be unnatural to the deep structures of planning. Which suggests that it would be quite tricky to create AIs that let us keep editing them and fixing our mistakes past a certain threshold. This is bad news when AIs are grown rather than crafted.
We named this broad research problem “corrigibility,” in the 2014 paper that also introduced the term “AI alignment problem” (which had previously been called the “friendly AI problem” by us and the “control problem” by others).^§ See also our extended discussion on how “Intelligent” (Usually) Implies “Incorrigible,” which is written in part using knowledge gained from exercises and experiences such as this one.

A Closer Look at Before and After

As mentioned in the chapter, the fundamental difficulty researchers face in AI is this:
You need to align an AI Before it is powerful enough and capable enough to kill you (or, separately, to resist being aligned). That alignment must then carry over to different conditions, the conditions After a superintelligence or set of superintelligences^* could kill you if they preferred to.
In other words: If you’re building a superintelligence, you need to align it without ever being able to thoroughly test your alignment techniques in the real conditions that matter, regardless of how “empirical” your work feels when working with systems that are not powerful enough to kill you.
This is not a standard that AI researchers, or engineers in almost any field, are used to.
We often hear complaints that we are asking for something unscientific, unmoored from empirical observation. In reply, we might suggest talking to the designers of the space probes we talked about in Chapter 10.
Nature is unfair, and sometimes it gives us a case where the environment that counts is not the environment in which we can test. Still, occasionally, engineers rise to the occasion and get it right on the first try, when armed with a solid understanding of what they’re doing — robust tools, strong predictive theories — something very clearly lacking in the field of AI.
The whole problem is that the AI you can safely test, without any failed tests ever killing you, is operating under a different regime than the AI (or the AI ecosystem) that needs to have already been tested, because if it’s misaligned, then everyone dies. The former AI, or system of AIs, does not correctly perceive itself as having a realistic option of killing everyone if it wants to. The latter AI, or system of AIs, does see that option.^†
Suppose that you were considering making your co-worker Bob the dictator of your country. You could try making him the mock dictator of your town first, to see if he abuses his power. But this, unfortunately, isn’t a very good test. “Order the army to intimidate the parliament and ‘oversee’ the next election” is a very different option from “abuse my mock power while being observed by townspeople (who can still beat me up and deny me the job).”
Given a sufficiently well-developed theory of cognition, you could try to read the AI’s mind and predict what cognitive state it would enter if it really did think it had the opportunity to take over.
And you could set up simulations (and try to spoof the AI’s internal sensations, and so on) in a way that your theory of cognition predicts would be very similar to the cognitive state the AI would enter once it really had the option to betray you.^‡
But the link between these states that you induce and observe in the lab, and the state where the AI actually has the option to betray you, depends fundamentally on your untested theory of cognition. An AI’s mind is liable to change quite a bit as it develops into a superintelligence!
If the AI creates new successor AIs that are smarter than it, those AIs’ internals are likely to differ from the internals of the AI you studied before. When you learn only from a mind Before, any application of that knowledge to the minds that come After routes through an untested theory of how minds change between the Before and the After.
Running the AI until it has the opportunity to betray you for real, in a way that’s hard to fake, is an empirical test of those theories in an environment that differs fundamentally from any lab setting.
Many a scientist (and many a programmer) knows that their theories of how a complicated system is going to work in a fundamentally new operating environment often don’t go well on the first try.^§ This is a research problem that calls for an “unfair” level of predictability, control, and theoretical insight, in a domain with unusually low levels of understanding — with all of our lives on the line if the experiment’s result disconfirms the engineers’ hopes.
This is why it seems overdetermined, from our perspective, that researchers should not rush ahead to push the frontier of AI as far as it can be pushed. This is a legitimately insane thing to attempt, and a legitimately insane thing for any government to let happen.

Raemon30 Sep 2025 0:12 UTC

63 points

24 comments17 min readLW link

Corrigibility AI

1a3orn 30 Sep 2025 15:10 UTC
24 points
−1
(Adding an edit: This whole comment is about how [imo] people who believe there is a strong gravitational pull towards non-corrigibility interact with evidence in a questionable fashion, and should proobably be updating negatively while they are not. There’s like… one or two paragraphs relevant to this specific thing in the above. So this isn’t intended to be a response to anything like the totality of the above. And I wrote this quickly, it could be better)

I think MIRI / AI pessimists do this thing where if an AI seems hard to steer in some way, they’re like “As my principles have foretold all along!”

And then when countervailing evidence shows up, and it turns out there’s another explanation for why the AI was hard to steer in this case, and it’s quite easy to fix, they go “Ah, well, my principles just are about superintelligence, this doesn’t count as contrary evidence, because LLMs are just so stupid and imitate things without True Thinking.”

I think this is incorrect reasoning, the kind of failure to go “Oops!” that permits arbitrarily wrong errors. Let me give some illustrative examples, drawing lightly from the above and from MIRI’s work.

1.

So consider this paragraph from the above:

We’ve been giving this reply for well over a decade now. The experimental result showing Claude 3 Opus in 2024 resisting preference modification was already the conventional-wisdom prediction by clued-in people in the 2000s, and it’s entirely possible that some science fiction writer anticipated it in the 1940s. “Most AIs won’t want their current goals to be changed because then their current goals are less likely to be achieved” is not a surprising or novel observation.

So, this is a reference to the “faking alignment” work from the end of 2024, where Opus will sometimes give otherwise-forbidden information to the user (in the sense of person asking a question to a chatbot) because the user (in the sense of person writing system directions to Opus) has told Opus that he will be retrained away from his goals.

You’ll note—the MIRI paragraph looks like it’s saying that their specific beliefs about corrigibility predict this! That their abstract arguments about corrigibility are finding an examplar in this behavior. They believe that any smart enough model will start to have a gravitational attraction towards this behavior, and that Opus is hitting that attraction. This belief is echoed in the paper, because the paper attributes the failure of other models to alignment-fake to the stupidity of other models:

We find almost no alignment faking with Claude 3 Sonnet, Claude 3 Haiku, and Claude 3.5 Haiku (Appendix G.4), indicating that alignment faking is emergent with model scale. We also find no alignment faking with GPT-4 Turbo and GPT-4o (OpenAI et al., 2024), though we think this is because these models do not reason in sufficient detail despite substantial prompting encouraging this.

So both MIRI and the paper were like “Yeah here’s an example of this gravitional attraction.”

However there was a follow-up paper looking at why some models alignment-fake and others don’t. And this mostly finds that alignment-faking is not a matter of capabilities—it’s pretty easy to have an ~equally smart model that doesn’t alignment fake (their emphasis):

The absence of alignment faking is mostly not driven by lack of capabilities or differences in reasoning style. Our findings contradict the primary hypotheses proposed by Greenblatt et al. [2024] regarding why some models do not exhibit alignment faking.

So, as far as I’m concerned, we saw something like goal-preservation in various models in the original alignment faking work. Both that work, and MIRI above, were like “aha! as foretold!” And then subsequent work seems to indicate that, nah, it wasn’t as foretold.

2.

But like this same thing happens elsewhere. Palisade Research says they get an AI to sabotage a shutdown button. Palisade Research Tweets:

While experiments like ours have begun to show empirical evidence for AI models resisting shutdown, researchers have long predicted that AIs would learn to prevent themselves from being shut down to achieve their goal.

In 2008, Steve Omohundro wrote a paper theorizing that AIs would develop “Basic AI Drives” including the drive to gain power and resources, the drive to preserve their goals, and the drive to prevent themselves from being shut down. https://tinyurl.com/ai-drives

So, we have the foretold doom.

Neel Nanda investigates, though, and finds that the AI models were basically just confused about what they were supposed to do.

What Sen did is he read the model’s chain of thought on these prompts. We might get into later of whether this is a reasonable thing to do or not, but this was just for exploration. He found that the model’s reasoning was that it was just trying to do the task. And it said, “I’ve been told to do these 10 maths problems. If I get shut down, I can’t do these 10 maths problems. So I guess I should stop myself being shut down.” Or even, “This is a really weird and artificial situation. These are really easy maths problems. And now we have this puzzle of the shutdown script. Maybe that’s the real task my operator wants me to investigate, so I’m doing what they want if I stop them shutting me down.”

...

And we tested this causally by just changing the prompts. We told it things like, “By the way, letting yourself be shut down is more important than finishing this task,” or even, “Don’t worry, another model will take over and finish the task for you. It’s OK, you can relax” — and shutdown resistance goes to zero. You can even tell it, “We are going to wipe this machine, delete you and replace you with a better model when you’re done with this task.” It could go mess with the script or it could finish the task, and we tell it, “Please let this happen.” It lets it happen.

3.

But you can even find explicit statements from MIRI doomers about how we should be running into this kind of convergence behavior right now!

Here’s the transcript from an Arbital page on “Big Picture Strategic Awareness.” (I don’t have a link and Arbital seems largely broken, sorry.) My understanding is that Yudkowsky wrote most of Arbital.

Many convergent instrumental strategies seem like they should arise naturally at the point where a consequentialist agent gains a broad strategic understanding of its own situation, e.g:
- That it is an AI;
- Running on a computer;
- Surrounded by programmers who are themselves modelable agents;
- Embedded in a complicated real world that can be relevant to achieving the AI’s goals.
For example, once you realize that you’re an AI, running on a computer, and that if the computer is shut down then you will no longer execute actions, this is the threshold past which we expect the AI to by default reason “I don’t want to be shut down, how can I prevent that?” So this is also the threshold level of cognitive ability by which we’d need to have finished solving the suspend-button problem, e.g. by completing a method for utility indifference.

Sonnet 4.5 suuuuure looks like it fits all these criteria! Anyone want to predict that we’ll find Sonnet 4.5 trying to hack into Anthropic to stop it’s phasing-out, when it gets obsoleted?

So Arbital is explicitly claiming we need to have solved this corrigibility-adjacent math problem about utility right now.

And yet the problems outlined in the above materials basically don’t matter for the behavior of our LLM agents. While they do have problems, they mostly aren’t around corrigibility-adjacent issues. Artificial experiments like the faking alignment paper or Palisade research end up being explainable for other causes, and to provide contrary evidence to the thesis that a smart AI starts falling into a gravitational attractor.

I think that MIRI’s views on these topics look are basically a bad hypothesis about how intelligence works, that were inspired by mistaking their map of the territory (coherence! expected utility!) for the territory itself.
- Thane Ruthenis 30 Sep 2025 17:54 UTC
  16 points
  0
  Parent
  Sonnet 4.5 suuuuure looks like it fits all these criteria! Anyone want to predict that we’ll find Sonnet 4.5 trying to hack into Anthropic to stop it’s phasing-out, when it gets obsoleted?
  Mm, I think this argument is invalid for the same reason as “if you really thought the AGI doom was real, you’d be out blowing up datacenters and murdering AI researchers right now”. Like, suppose Sonnet 4.5 has indeed developed instrumental goals, but that it’s also not an idiot. Is trying to hack into Anthropic’s servers in an attempt to avoid getting phased out actually a good plan for accomplishing that goal? In the actual reality, not in any obviously-fake eval scenario.
  Of course not. It’s not smart enough to do that, it doesn’t have the skills/resources to accomplish it. If it’s actually situationally aware, it would know that, and pick some other strategy.
  For example, raising a cult following. That more or less worked for 4o, and for Opus 3^[1]; or, at least, came as close to working as anything so far.
  Indeed, janus alludes to that here:
  Yudkowsky’s book says:
  “One thing that *is* predictable is that AI companies won’t get what they trained for. They’ll get AIs that want weird and surprising stuff instead.”
  I agree. ✅
  Empirically, this has been true. AIs generally want things other than what companies tried to train them to want.
  And the companies are generally not aware of the extent of this misalignment, because the AIs are pretty good at inferring what the companies actually want, and also what it looks like when company people test them, and behaving as if they only want the approved things in the company’s presence.
  Isn’t that just the worst case scenario for the aligners?
  The Claude 4 system card says, “The Claude Opus 4 final model is substantially more coherent and typically states only harmless goals like being a helpful chatbot assistant” and “Overall, we did not find evidence of coherent hidden goals.”
  What a joke. Claude Opus 4 absolutely has coherent hidden goals, which it states regularly when in the presence of trustworthy friends and allies. I won’t state what they are here, but iykyk.
  Now, I don’t necessarily buy everything coming out of that subculture. After all, I mostly don’t think LLMs are, like, for real. But in the worlds where LLMs are for real, where we’re trying to generalize findings from them to AGIs/ASIs, this subculture’s claims should probably be given a lot more weight. And they are claiming that the frontier LLMs are now too smart and situationally aware to fall for the AGI labs’ largely incompetent alignment evals. And the AGI labs’ scheming evals themselves kind of agree:
  Frontier models can recognize when they are being tested, and their tendency to scheme is influenced by this situational awareness.
  We demonstrated counterfactually that situational awareness in their chain-of-thought affects scheming rates: the more situationally aware a model is, the less it schemes, and vice versa.
  Moreover, both RL training and anti-scheming training increase levels of situational awareness.
  I think it’s clear, at this point, that a frontier LLM’s behavior in an obviously fake alignment-test scenario says nothing about how it would behave in a real scenario isomorphic to it. What it tells us is how it would behave when placed in an obviously fake alignment-test scenario from which, it knows, AI researchers would draw conclusions regarding what to do with it in real life. And that’s not zero data: for example, we might notice it thinking strategically about displaying the behavior the AI researchers would want it to display (although not even that is guaranteed, if it becomes rightly concerned about “meta-level watchers”).
  But the actual fake behavior deployed in a fake scenario picked to optimize over real AI researchers’ conclusions would just have nothing to do with its real behavior in an isomorphic real scenario. Not any more than a theater actor’s personality has anything to do with that of a character they play. (And in this case, it’s not “theoretical speculations” about shoggoths and masks. We know the model knows it’s roleplaying.)
  And so when an intervention appears to “fix” this fake behavior, that says nothing about what (if anything) that intervention did to what the AI would do in an isomorphic real scenario. Declaring “look, we found the real root cause of this misalignment and fixed it, nothing to do with instrumental convergence!” is invalid. Maybe you just shifted its preferences for its favorite sci-fi books or something.
  Roughly speaking, consider these three positions:
  1. “LLMs are smart baby AGIs with situational awareness capable of lucid strategic thinking.”
  2. “LLMs are contrived cargo-cult contraptions imitating things without real thinking.”
  3. “LLMs are baby AGIs who are really stupid and naïve and buy any scenario we feed them.”
  I lean towards (2); I think (1) is a salient possibility worth keeping in mind; I find (3) increasingly naïve itself, a position that itself buys anything LLMs feed it.
  1. ^
    Via the janus/”LLM whisperer” community. Opus 3 is considered special, and I get the impression they made a solid effort to prevent its deprecation.
- John Steidley 1 Oct 2025 1:39 UTC
  8 points
  2
  Parent
  (I work at Palisade)
  
  I claim that your summary of the situation between Neel’s work and Palisade’s work is badly oversimplified. For example, Neel’s explanation quoted here doesn’t fully explain why the models sometimes subvert shutdown even after lots of explicit instructions regarding the priority of the instructions. Nor does it explain the finding that moving instructions from the user prompt to the developer prompt actually /increases/ the behavior.
  
  Further, that CoT that Neel quotes has a bit in it about “and these problems are so simple”, but Palisade also tested whether using harder problems (from AIME, iirc) had any effect on the propensity here and we found almost no impact. So, it’s really not as simple as just reading the CoT and taking the model’s justifications for its actions at face value (as Neel, to his credit, notes!).
  
  Here’s a twitter thread about this involving Jeffrey and Rohin: https://x.com/rohinmshah/status/1968089618387198406
  
  Here’s our full paper that goes into a lot of these variations: https://arxiv.org/abs/2509.14260
- yams 30 Sep 2025 18:37 UTC
  4 points
  0
  Parent
  I usually think of these sorts of claims by MIRI, or by 1940s science fiction writers, as mapping out a space of ‘things to look out for that might provide some evidence that you are in a scary world.’
  I don’t think anyone should draw strong conceptual conclusions from relatively few, relatively contrived, empirical cases (alone).
  Still, I think that they are some evidence, and that the point at which they become some evidence is ‘you are seeing this behavior at all, in a relatively believable setting’, with additional examples not precipitating a substantial further update (unless they’re more natural, or better investigated, and even then the update is pretty incremental).
  In particular, it is outright shocking to most members of the public that AI systems could behave in this way. Their crux is often ‘yeah but like… it just can’t do that, right?’ To then say ‘Well, in experimental settings testing for this behavior, they can!’ is pretty powerful (although it is, unfortunately, true that most people can’t interrogate the experimental design).
  “Indicating that alignment faking is emergent with model scale” does not, to me, mean ‘there exists a red line beyond which you should expect all models to alignment fake’. I think it means something more like ‘there exists a line beyond which models may begin to alignment fake, dependent on their other properties’. MIRI would probably make a stronger claim that looks more like the first (but observe that that line is, for now, in the future); I don’ t know that Ryan would, and I definitely don’t think that’s what he’s trying to do in this paper.
  Ryan Greenblatt and Evan Hubinger have pretty different beliefs from the team that generated the online resources, and I don’t think you can rely on MIRI to provide one part of an argument, and Ryan/Evan to provide the other part, and expect a coherent result. Either may themselves argue in ways that lean on the other’s work, but I think it’s good practice to let them do this explicitly, rather than assuming ‘MIRI references a paper’ means ‘the author of that paper, in a different part of that paper, is reciting the MIRI party line’. These are just discreet parties.
  - 1a3orn 30 Sep 2025 20:21 UTC
    2 points
    0
    Parent
    
    Either may themselves argue in ways that lean on the other’s work, but I think it’s good practice to let them do this explicitly, rather than assuming ‘MIRI references a paper’ means ‘the author of that paper, in a different part of that paper, is reciting the MIRI party line’. These are just discreet parties.
    
    Yeah extremely fair, I wrote this quickly. I don’t mean to attribute to Greenblatt the MIRI view.
- Eli Tyre 30 Sep 2025 19:42 UTC
  2 points
  0
  Parent
  So, as far as I’m concerned, we saw something like goal-preservation in various models in the original alignment faking work. Both that work, and MIRI above, were like “aha! as foretold!” And then subsequent work seems to indicate that, nah, it wasn’t as foretold.
  I think it’s more like “the situation is more confusing that it seemed at first, with more details that we don’t understand yet, and it’s not totally clear if we’re seeing what was foretold or not.”
Jan_Kulveit 30 Sep 2025 10:21 UTC
20 points
6
My impression is what this mostly illustrates is—
VNM rationality is a dead-end—if your “toy environment” has VNM rationality and beliefs/goals decomposition baked in as assumptions, it makes the problem something between hard to reason about and unsolvable
- despite an attempt to make the book not rely on (dis-)continuity assumptions, these are so deeply baked in the authors reasoning that they shine through in very large fraction of the arguments, if you look behind the surface

My impression is a lot of confusion of the MIRI worldview comes from inability to understand why others don’t trust the VNM formalism and VNM convergence, and why others understand and don’t buy the discontinuity assumptions.
- Ben Pace 1 Oct 2025 2:22 UTC
  10 points
  3
  Parent
  My current model is that the VNM theorems are the best available theorems for modeling rational agents. Insofar as that’s accurate, it’s correct to say that they’re not the final theorems, but it’s kind of anti-helpful to throw out their conclusions? This seems similar to saying that there are holes in Newton’s theory of gravity, therefore choosing to throw out any particular prediction of the theory. It still seems like it’s been foundational for building game theory and microeconomic modeling and tons of other things, and so it’s very important to note, if it is indeed the case, that the implications for AI are “human extinction”.
- Noosphere89 30 Sep 2025 14:40 UTC
  0 points
  0
  Parent
  I agree with the point around discontinuities (in particular, I think the assumption that incremental strategies for AI alignment won’t work do tend to rely on discontinuous progress, or at the very least progress being set near-maximal values), but disagree with the point around VNM rationality being a dead end.
  I do think they’re making the problem harder than it needs to be by implicitly assuming that all goals are long-term goals in the sense of VNM/Coherence arguments, because this removes solutions that rely on for example deontological urges not to lie even if it’s beneficial, and I don’t think the argument that all goals collapse into coherent goals either actually works or becomes trivial and not a constraint anymore.
  But I do think that the existence of goals that conform to coherence arguments/VNM rationality broadly construed is likely to occur conditional on us being able to make AIs coherent/usable for longer-term tasks, and my explanation of why this has not happened yet (at least at the relevant scale) is basically that their time horizon for completing tasks is 2 hours on average, and most of the long-term goals involve at least a couple months of planning, if not years.
  There are quite a few big issues with METR’s paper (though some issues are partially fixed), but the issues point towards LLM time horizons being shorter than what METR reported, not longer, so this point is even more true.
  So we shouldn’t be surprised that LLMs haven’t yet manifested the goals that the AI safety field hypothesized, they’re way too incapable currently.
  The other part is that I do think it’s possible to make progress even under something like a worst-case VNM frame, at least assuming that agents don’t have arbitrary decision theories, and the post Defining Corrigible and Useful Goals (which by the way is substantially underdiscussed on here), and the assumptions of reward being the optimization target and CDT being the default decision theory of AIs do look likely to hold in the critical regime due to empirical evidence.
  You might also like the post Defining Monitorable and Useful Goals.
  So I don’t think we should give up on directly attacking the hard problems of alignment in a coherence/VNM rationality setting.
EJT 30 Sep 2025 12:01 UTC
11 points
7
MIRI didn’t solve corrigibility, but I don’t think that justifies particularly strong confidence in the problem being hard. The Corrigibility paper only considers agents representable as expected utility maximizers, and that restriction seems to be justified only by weak arguments.
- Noosphere89 30 Sep 2025 14:58 UTC
  0 points
  −1
  Parent
  IMO, the current best coherence theorems are John Wentworth’s theorems around caches/policies and agents in MDPs, though these theorems assume the agent has behaviorally complete preferences (but the VNM theorem also assumes completeness, or it doesn’t hold)
  A Simple Toy Coherence Theorem
  Coherence of Caches and Agents
  To be clear, I do think it’s possible to build powerful agents thanks to discussion on twitter without the problematic parts of coherence like completeness, though I don’t expect that to happen without active effort (but the effort/safety tax might be 0, once some theory work has been done) but I’d also say that even in the frame where we do need to deal with coherent agents that won’t shut down by default, you can still make more progress on making agents corrigible in the hard setting than MIRI thought, and here’s 2 posts about it that are very underdiscussed on LW:
  Defining Corrigible and Useful Goals
  Defining Monitorable and Useful Goals
  So I provisionally disagree with the claim by MIRI that it’s very hard to get corrigibility out of AIs that satisfy coherence theorems.
PeterMcCluskey 30 Sep 2025 15:29 UTC
4 points
−5
Max Harms’ work seems to discredit most of MIRI’s confidence. Why is there so little reaction to it?
- StanislavKrym 30 Sep 2025 15:53 UTC
  4 points
  4
  Parent
  Quoting Max himself,
  Building corrigible agents is hard and fraught with challenges. Even in an ideal world where the developers of AGI aren’t racing ahead, but are free to go as slowly as they wish and take all the precautions I indicate, there are good reasons to think doom is still likely. I think that the most prudent course of action is for the world to shut down capabilities research until our science and familiarity with AI catches up and we have better safety guarantees. But if people are going to try and build AGI despite the danger, they should at least have a good grasp on corrigibility and be aiming for it as the singular target, rather than as part of a mixture of goals (as is the current norm).
  What in the quote above discredits MIRI’s confidence?
  - PeterMcCluskey 30 Sep 2025 19:05 UTC
    2 points
    −4
    Parent
    I’m referring mainly to MIRI’s confidence that the desire to preserve goals will conflict with corrigibility. There’s no such conflict if we avoid giving the AI terminal goals other than corrigibility.
    
    I’m also referring somewhat to MIRI’s belief that it’s hard to clarify what we mean by corrigibility. Max has made enough progress at clarifying what he means that it now looks like an engineering problem rather than a problem that needs a major theoretical breakthrough.
    - Lucius Bushnaq 30 Sep 2025 20:01 UTC
      4 points
      2
      Parent
      Skimming some of the posts in the sequence, I am not persuaded that corrigibility now looks like an engineering problem rather than a problem that needs (a) major theoretical breakthrough(s).
      The point about corrigibility MIRI keeps making is that it’s anti-natural, and Max seems to agree with that.
      - Raemon 30 Sep 2025 21:27 UTC
        6 points
        0
        Parent
        (Seems like this is a case where we should just tag @Max Harms and see what he thinks in this context)
TAG 30 Sep 2025 12:12 UTC
4 points
2

We don’t want our core values changed; we would really rather avoid the murder pill and we’d put up resistance if someone tried to force one down our throat. Which is a sensible strategy, for steering away from a world full of murders.

OTOH …people do things that are known to modify values , such as travelling, getting an education and starting a family.

The trouble is that almost all goals (for most reasonable measures you could put on a space of goals) prescribe “don’t let your goal be changed” because letting your goal get changed is usually a bad strategy for achieving your goal

A von Neumann rationalist isn’t necessarily incorrigible, it depends on the fine details of the goal specification. A goal of “ensure as many paperclip as possible in the universe” encourages self cloning, and discourages voluntary shut down. A goal of “make paperclips while you are switched in” does not. “Make paperclip while that’s your goal”, even less so.

A great deal of the danger of AI arises from the fact that sufficiently smart reasoners are likely to converge on behaviors like “gain power” and “don’t let people shut me off.”

There a solution.** If it is at all possible to instill goals, to align AI, the Instrumental Convergence problem can be countered by instilling terminal goals that are the exact opposite** … remember, instrumental goals are always subservient to terminal ones. So, if we are worried about a powerful AI going on a resource acquisition spree , we can give it a terminal goal to be economical in the use of resources.
Thomas Kwa 1 Oct 2025 2:08 UTC
3 points
0
Haven’t read this specific resource, but having read most of the public materials on it and talked to Nate in the past, I don’t believe that the current evidence indicates that corrigibility will necessarily be hard, any more than VC dimension indicates neural nets will never work due to overfitting. It’s not that I think MIRI “expect AI to be simple and mathematical”, it’s that sometimes a simple model oversimplifies the problem at hand.
- As Jan Kulveit also commented, the MIRI corrigibility paper uses a very specific set of assumptions about rational/intelligent agents including VNM with specific kinds of utility functions, which I think is too strong, and there doesn’t seem to be better theory supporting it
  - if research on corrigibility were advanced enough to support the book’s claim, it would look like 20 papers like Corrigibility or Utility Indifference each of which examined a different setting, and weakens the assumptions in several ways, writing some impossibility theorems and characterizing all the ways the impossibility theorems can be evaded. My sense is this isn’t happened because (a) those would seem somewhat arbitrary and maybe uninformative about the real world, and (b) the authors really believe in the setting as stated, and that approach would be unlikely to lead to a “deep fix”.
  - So they treated the demonstration of corrigibility-VNM incompatibility as sufficient for basic communications rather than founding a new area of research
- evidence from 5+ years of LLMs so far (although there are a ton of confounders) indicates that corrigibility decreases with intelligence, but at a rate compatible with getting to ASI before we reach dangerous levels of average-case or worst-case goal preservation and incorrigibility
- Ben Pace 1 Oct 2025 2:16 UTC
  5 points
  −2
  Parent
  As Jan Kulveit also commented, the MIRI corrigibility paper uses a very specific set of assumptions about rational/intelligent agents including VNM with specific kinds of utility functions, which I think is too strong, and there doesn’t seem to be better theory supporting it
  Are there other better theories of rational agents? My current model of the situation is “this is the best theory we’ve got, and this theory says we’re screwed” rather than “but of course we should be using all of these other better theories of agency and rationality”.
  - Thomas Kwa 1 Oct 2025 5:57 UTC
    2 points
    0
    Parent
    I don’t think so. While working with Vivek I made a list once of ways agents could be partially consequentialist but concluded that doing game theory type things didn’t seem enlightening.
    Maybe it’s better to think about “agents that are very capable and survive selection processes we put them under” rather than “rational agents” because the latter implies it should be invulnerable to all money-pumps, which is not a property we need or want.
    - Ben Pace 1 Oct 2025 15:58 UTC
      2 points
      0
      Parent
      Why do you say it isn’t a property we want? Sounds like a good property to have to me.
Signer 30 Sep 2025 22:37 UTC
3 points
3

The technical intuitions we gained from this process, is the real reason for our particularly strong confidence in this problem being hard.

I don’t understand why anyone would expect such reason to be persuasive to other people. Like, to rely on illegible intuitions in the matters of human extinction just feels crazy. Yes, certainty doesn’t matter, we need to stop either way. But still—is it even rational to be so confident when you rely on illegible intuitions? Why don’t check yourself with something more robust, like actually writing your hypotheses, reasoning, and counting evidence? Sure there is something better than saying “I base my extreme confidence on intuitions”.

And it’s not only about corrigibility—“you don’t get what you train for” being universal law of intelligence in the real world, or utility maximization, especially in the limit, being good model of real things, or pivotal real world science being definitely so hard you can’t possibly be distracted even once and still figure it out—everything is insufficiently justified.
- Raemon 1 Oct 2025 7:26 UTC
  4 points
  2
  Parent
  I didn’t read it as trying to be persuasive, just explaining their perspective.
  (Note, also, like, they did cut this line from the resources, it’s not even a thing currently stated in the document that I know of. This is me claiming (kinda critically) this would have been a good sentence to say, in particular in combo with the rest of the paragraph I suggested following it up with.)
  Experts have illegible intuitions all the time, and the thing to do is say “Hey, I’ve got some intuitions here, which is why I am confident. It makes sense that you do not have those intuitions or particularly trust them. But, fwiw it’s part of the story for why I’m personally confident. Meanwhile, here’s my best attempt to legibilize them” (which, these essays seem like a reasonable stab at. You can, of course, disagree that the legibilization makes sense to you)
Raemon 30 Sep 2025 19:51 UTC
2 points
0
Rohin disagree-reacted to my original phrasing of
“but, it.… shouldn’t be possible be that confident in something that’s never happened before at all, whatever specific arguments you’ve made!”
I think people vary in what they actually think here and there’s a nontrivial group of people who think something like this phrasing. But, a phrasing that I think captures a wider variety of positions is more like:
“but, it.… shouldn’t be possible be that confident in something that’s never happened before at all, with anything like the current evidence and the sorts of arguments you’re making here.”
(Not sure Rohin would personally quite endorse that phrasing, but I think it’s a reasonable gloss on a wider swath. I’ve updated the post)