Take 14: Corrigibility isn’t that great.

As a writing exercise, I’m writing an AI Alignment Hot Take Advent Calendar—one new hot take, written every day some days for 25 days.

It’s the end (I saved a tenuous one for ya’)! Kind of disappointing that this ended up averaging out to one every 2 days, but this was also a lot of work and I’m happy with the quality level. Some of the drafts that didn’t work as “hot takes” will get published later.

I

There are certainly arguments for why we want to build corrigible AI. For example, the problem of fully updated deference says that if you build an AI that wants things, even if it’s uncertain about what it wants, it knows it can get more of what it wants if it doesn’t let you turn it off.

The metal image this conjures up is of an AI doing something that’s obvious-to-humans bad, and us clamoring to stop, but it blocking us from turning it off because we didn’t solve the problem of fully updated deference. It would be better if we built an AI that took things slow, and that would let us shut it off if we got to look at what it was doing and saw that it was obviously bad.

Don’t get me wrong, this could be a nice property to have. But I don’t think it’s all that likely to come up, because aiming at aligned AI means building AI that tries not to do obviously bad stuff.

A key point is that corrigibility is only desirable if you actually expect to use it. Its primary sales pitch is that it might give us a mulligan on an AI that starts doing obviously bad stuff. If everything goes great and we wind up in a post-scarcity utopia, I’m not worried about whether the AI would let me turn it off if I counterfactually wanted to.

A world where corrigibility is useful might look like us building an agenty AI with a value learning process that we’re not confident in, letting it run and interacting with it to try to judge how the value learning is going, and then (with moderate probability) turning it off and trying again with another idea for value learning. What does corrigibility have to do in this world? The AI shouldn’t deliberately try to get shut down by doing obviously-bad things, but it also shouldn’t try to avoid being shut down by instrumentally hiding bad behavior, or by backing itself up on AWS.

Such indifference to the outside world is the default for limited AI that doesn’t model that part of the world, or doesn’t make decisions in a very coherent way. But in an agent that’s good at navigating the real world, a lot of corrigibility is made out of value learning. The AI probably has to actively notice when it’s coming into conflict with humans (and specifically humans, rather than head lice) and defer to them, even if those humans want to shut down the AI or rewrite its value learning process.

So the first issue: if you can already do things like noticing when you’re coming into conflict with humans, I fully expect you can build an AI that tries not to do things the humans think are obviously bad. And even though this has dangers, notably making corrigibility less likely to be used by making AIs avoid doing obviously-bad things, what the hell are you trying to do value learning for if you’re not going to use it to get the AI to do good things and not bad things?

II

Second issue: sometimes agenty properties are good. An incorrigible AI is one that endorses some value learning process or meta-process, and will defend that good process against random noise, and against humans who might try to modify the process selfishly or short-sightedly.

The point of corrigibility is that it the AI should not trust its own judgement about what counts as “short-sighted” for the human, and should let itself be shut down or modified. But sometimes humans are like a toddler in a self-driving car, and you don’t want the car to listen when they press the emergency stop button. And more vaguely, I don’t want corrigibility’s unnaturalness to leak out and interfere with a super-powerful AI protecting what it finds good.

Maybe there’s a fine line we can tread here where there’s some parameter for how much the AI protects its goals that changes as we gain trust in the AI’s reasoning process, but it seems plausible that corrigibility creates more problems than it solves for the future when we’re pretty confident in the value learning process.

I’m not saying we can’t test things. If we want to test an AI’s value learning process without having problems with creating an adversarial agent, then the safest way is to not create an agent at all—just directly test a generative world-model, or a plan generator that’s not hooked up to anything, or what have you. In many ways, this is corrigibility, just an extreme form that makes the AI useless for deployment.

When we actually build any superintelligent agent, I’d rather that we just have a value-learning process that we trust. One that not only doesn’t do obviously bad things, but goes so far as to not do obviously bad meta-level reasoning either. It’s been speculated that a superintelligent AI would reinvent corrigibility so it could give it to its successor AIs. I bet a superintelligent AI would just solve value learning instead.