A first look at the hard problem of corrigibility

Summary: We would like to build corrigible AIs, which do not prevent us from shutting them down or changing their utility function. While there are some corrigibility solutions (such as utility indifference) that appear to partially work, they do not capture the philosophical intuition behind corrigibility: we want an agent that not only allows us to shut it down, but also desires for us to be able to shut it down if we want to. In this post, we look at a few models of utility function uncertainty and find that they do not solve the corrigibility problem.


Introduction

Eliezer describes the hard problem of corrigibility on Arbital:

On a human, intuitive level, it seems like there’s a central idea behind corrigibility that seems simple to us: understand that you’re flawed, that your meta-processes might also be flawed, and that there’s another cognitive system over there (the programmer) that’s less flawed, so you should let that cognitive system correct you even if that doesn’t seem like the first-order right thing to do. You shouldn’t disassemble that other cognitive system to update your model in a Bayesian fashion on all possible information that other cognitive system contains; you shouldn’t model how that other cognitive system might optimally correct you and then carry out the correction yourself; you should just let that other cognitive system modify you, without attempting to manipulate how it modifies you to be a better form of ‘correction’.

Formalizing the hard problem of corrigibility seems like it might be a problem that is hard (hence the name). Preliminary research might talk about some obvious ways that we could model A as believing that B has some form of information that A’s preference framework designates as important, and showing what these algorithms actually do and how they fail to solve the hard problem of corrigibility.

The objective of this post is to be some of the preliminary research described in the second paragraph.

Setup

We will assume that the AI exists in the same world as the human. We will examine various models the AI could use for the human and the true utility function. None of these models will truly yield a corrigible agent.

1. The human is a logically omniscent Bayesian utility maximizer who knows their utility function

a) The human is a black box

i) The human is aware of the AI

If the AI models the human as a black box Bayesian utility maximizer who knows about the AI, then it can assume that the human will communicate their utility function to the AI efficiently. This leads to a signalling equilibrium in which the human communicates the correct utility function to the AI using an optimal code. So the AI will assume that the human communicates the utility function, e.g. by writing it as a computer program.

Of course, in real life this will not work, because the human will be unable to write their utility function.

ii) The human is not aware of the AI

If the human is not aware of the AI, then the AI must learn the human’s values by observing the human interacting with the world, rather than through signalling. Since this model assumes the human is perfectly rational, it is very close to value learning models used in economics. However, these models are inappropriate for corrigibility, because corrigibility requires the human to interact with the AI (e.g. by shutting it down). Additionally, the AI will want to manipulate the human into being an efficient morality sensor; for example, it may set up trolley problems for the human to encounter. This will not yield the right answer unless the value learning model is correct (which it isn’t, because humans are not logically omniscent Bayesian utility maximizers).

b) The human is not a black box

Here, the AI can possibly gain information about the human’s utility function faster than in the signalling equilibrium, by taking apart the human’s brain (literally or metaphorically). This will give a sufficiently powerful AI enough information to predict the human’s actions in many different possible situations. Therefore, the AI will need to further observations of the human. We expect this to be bad, because it requires the AI’s value learning algorithm to be correct from the start. Certainly, this does not count as corrigible behavior!

2. The human is a logically omniscent Bayesian utility maximizer who is uncertain about their utility function

Bayesian uncertainty about the utility function, with no way of learning about the utility function, will not change much. If there is no way to learn about one’s utility function, then (depending on one’s approach to normative uncertainty) the optimal action is to optimize a weighted average of them, or something similar to this. So this situation is really the same as in the case with the known utility function.

3. The human is a logically omniscent Bayesian utility maximizer who is uncertain about their utility function and observes it over time

a) The human is a black box

i) The human is aware of the AI

As in 1ai, we get a signalling equilibrium. In this idealized model, instead of communicating the full utility function at the start, the human will communicate observations of the true utility function (i.e. moral intuitions) over time. So the AI will keep the human alive and use them as a morality sensor. Similar to 1a, this fails in real life because it requires the human to specify their moral intuitions.

ii) The human is not aware of the AI

As in 1aii, the AI will learn about the human’s values from the human’s interaction with the world. This is different from 1aii in that the AI cannot assume that the human makes consistent decisions over time (because the human learns more about the utility function over time). However, in practice it is similar: the AI will manipulate the human into being an optimal morality sensor, only it will do this differently to account for the fact that the human gains moral updates over time.

b) The human is not a black box

As in 1b, the AI might more efficiently gain information about the utility function by taking apart the human’s brain. Then, it can predict the human’s actions in possible future situations. This includes predictions about what moral intuitions the human would communicate. Similar to 1b, this fails in real life because it requires the AI’s value learning algorithm to be correct.

4. One of the above, but the human also has uncertainty about mathematical statements

In this case the human solves the problem as before, except that they delegate mathematical uncertainty questions to the AI. For example, the human might write out their true utility function as a mathematical expression that contains difficult-to-compute numbers. This requires the AI to implement a solution to logical uncertainty, but even if we already had such a solution, this would still place unrealistic demands on the human (namely, reducing the value alignment problem to a mathematical problem).

Conclusion

This short overview of corrigibility models shows that simple uncertainty about the correct utility function is not sufficient for corrigibility. It is not clear what the correct solution to the hard problem of corrigibility is. Perhaps it will involve some model like those in this post in which humans are bounded in a specific way that causes them to desire corrigible AIs, or perhaps it will look completely different from these models.