Summary: We would like to build corrigible AIs, which do not prevent us from shutting them down or changing their
utility function. While there are some corrigibility solutions (such as utility indifference) that appear to partially work, they do not capture the philosophical intuition behind corrigibility: we want an agent that not only allows us to shut it down, but also desires for us to be able to shut it down if we want to. In this post, we look at a few models of utility function uncertainty and find that they do not solve the corrigibility problem.
On a human, intuitive level, it seems like there’s a central idea behind corrigibility that seems simple to us: understand that you’re flawed, that your meta-processes might also be flawed, and that there’s another cognitive system over there (the programmer) that’s less flawed, so you should let that cognitive system correct you even if that doesn’t seem like the first-order right thing to do. You shouldn’t disassemble that other cognitive system to update your model in a Bayesian fashion on all possible information that other cognitive system contains; you shouldn’t model how that other cognitive system might optimally correct you and then carry out the correction yourself; you should just let that other cognitive system modify you, without attempting to manipulate how it modifies you to be a better form of ‘correction’.
Formalizing the hard problem of corrigibility seems like it might be a problem that is hard (hence the name). Preliminary research might talk about some obvious ways that we could model A as believing that B has some form of information that A’s preference framework designates as important, and showing what these algorithms actually do and how they fail to solve the hard problem of corrigibility.
The objective of this post is to be some of the preliminary research described in the second paragraph.
Setup
We will assume that the AI exists in the same world as the human. We will
examine various models the AI could use for the human and the true utility
function. None of these models will truly yield a corrigible agent.
1. The human is a logically omniscent Bayesian utility maximizer who knows their utility function
a) The human is a black box
i) The human is aware of the AI
If the AI models the human as a black box Bayesian utility maximizer who knows about the AI, then it can assume
that the human will communicate their utility function to the AI efficiently. This leads to a
signalling equilibrium in which the human
communicates the correct utility function to the AI using an optimal code. So the AI will assume
that the human communicates the utility function, e.g. by writing it as a computer program.
Of course, in real life this will not work, because the human will be unable to write their utility
function.
ii) The human is not aware of the AI
If the human is not aware of the AI, then the AI must learn the human’s values by observing the human
interacting with the world, rather than through signalling. Since this model assumes the human
is perfectly rational, it is very close to value learning
models used in economics. However, these models are inappropriate for corrigibility, because
corrigibility requires the human to interact with the AI (e.g. by shutting it down). Additionally,
the AI will want to manipulate the human into being an efficient morality sensor;
for example, it may set up
trolley problems for the human to encounter. This will not yield the right answer unless the value
learning model is correct (which it isn’t, because humans are not logically omniscent Bayesian utility
maximizers).
b) The human is not a black box
Here, the AI can possibly gain information about the human’s utility function faster than in
the signalling equilibrium, by taking apart the human’s brain (literally or
metaphorically). This will give a sufficiently powerful AI enough information
to predict the human’s actions in many different possible situations. Therefore,
the AI will need to further observations of the human. We expect this to be bad,
because it requires the AI’s value learning algorithm to be correct from the start.
Certainly, this does not count as corrigible behavior!
2. The human is a logically omniscent Bayesian utility maximizer who is uncertain about their utility function
Bayesian uncertainty about the utility function, with no way of learning about the utility function,
will not change much. If there is no way to learn about one’s utility function, then (depending on one’s
approach to normative uncertainty) the optimal action is to optimize a weighted average of them, or something
similar to this. So this situation is really the same as in the case with the known utility function.
3. The human is a logically omniscent Bayesian utility maximizer who is uncertain about their utility function and observes it over time
a) The human is a black box
i) The human is aware of the AI
As in 1ai, we get a signalling equilibrium. In this idealized model, instead of communicating the full utility function at the start,
the human will communicate observations of the true utility function (i.e. moral intuitions) over time.
So the AI will keep the human alive and use them as a morality sensor.
Similar to 1a, this fails in real life because it requires the human to specify
their moral intuitions.
ii) The human is not aware of the AI
As in 1aii, the AI will learn about the human’s values from the human’s interaction with the world.
This is different from 1aii in that the AI cannot assume that the human makes consistent decisions
over time (because the human learns more about the utility function over time). However, in practice it is
similar: the AI will manipulate the human into being an optimal morality sensor, only it will do this
differently to account for the fact that the human gains moral updates over time.
b) The human is not a black box
As in 1b, the AI might more efficiently gain information about the utility function by taking apart the human’s brain.
Then, it can predict the human’s actions in possible future situations. This includes predictions about what
moral intuitions the human would communicate. Similar to 1b, this fails in real life because
it requires the AI’s value learning algorithm to be correct.
4. One of the above, but the human also has uncertainty about mathematical statements
In this case the human solves the problem as before, except that they delegate
mathematical uncertainty questions to the AI. For example, the human might
write out their true utility function as a mathematical expression that contains
difficult-to-compute numbers. This requires the AI to implement a solution to
logical uncertainty, but even if we already had such a solution, this would
still place unrealistic demands on the human (namely, reducing the value
alignment problem to a mathematical problem).
Conclusion
This short overview of corrigibility models shows that simple uncertainty about the correct utility
function is not sufficient for corrigibility. It is not clear what the correct solution
to the hard problem of corrigibility is. Perhaps it will involve some model like those in this post
in which humans are bounded in a specific way that causes them to desire corrigible AIs,
or perhaps it will look completely different from these models.
A first look at the hard problem of corrigibility
Summary: We would like to build corrigible AIs, which do not prevent us from shutting them down or changing their utility function. While there are some corrigibility solutions (such as utility indifference) that appear to partially work, they do not capture the philosophical intuition behind corrigibility: we want an agent that not only allows us to shut it down, but also desires for us to be able to shut it down if we want to. In this post, we look at a few models of utility function uncertainty and find that they do not solve the corrigibility problem.
Introduction
Eliezer describes the hard problem of corrigibility on Arbital:
The objective of this post is to be some of the preliminary research described in the second paragraph.
Setup
We will assume that the AI exists in the same world as the human. We will examine various models the AI could use for the human and the true utility function. None of these models will truly yield a corrigible agent.
1. The human is a logically omniscent Bayesian utility maximizer who knows their utility function
a) The human is a black box
i) The human is aware of the AI
If the AI models the human as a black box Bayesian utility maximizer who knows about the AI, then it can assume that the human will communicate their utility function to the AI efficiently. This leads to a signalling equilibrium in which the human communicates the correct utility function to the AI using an optimal code. So the AI will assume that the human communicates the utility function, e.g. by writing it as a computer program.
Of course, in real life this will not work, because the human will be unable to write their utility function.
ii) The human is not aware of the AI
If the human is not aware of the AI, then the AI must learn the human’s values by observing the human interacting with the world, rather than through signalling. Since this model assumes the human is perfectly rational, it is very close to value learning models used in economics. However, these models are inappropriate for corrigibility, because corrigibility requires the human to interact with the AI (e.g. by shutting it down). Additionally, the AI will want to manipulate the human into being an efficient morality sensor; for example, it may set up trolley problems for the human to encounter. This will not yield the right answer unless the value learning model is correct (which it isn’t, because humans are not logically omniscent Bayesian utility maximizers).
b) The human is not a black box
Here, the AI can possibly gain information about the human’s utility function faster than in the signalling equilibrium, by taking apart the human’s brain (literally or metaphorically). This will give a sufficiently powerful AI enough information to predict the human’s actions in many different possible situations. Therefore, the AI will need to further observations of the human. We expect this to be bad, because it requires the AI’s value learning algorithm to be correct from the start. Certainly, this does not count as corrigible behavior!
2. The human is a logically omniscent Bayesian utility maximizer who is uncertain about their utility function
Bayesian uncertainty about the utility function, with no way of learning about the utility function, will not change much. If there is no way to learn about one’s utility function, then (depending on one’s approach to normative uncertainty) the optimal action is to optimize a weighted average of them, or something similar to this. So this situation is really the same as in the case with the known utility function.
3. The human is a logically omniscent Bayesian utility maximizer who is uncertain about their utility function and observes it over time
a) The human is a black box
i) The human is aware of the AI
As in 1ai, we get a signalling equilibrium. In this idealized model, instead of communicating the full utility function at the start, the human will communicate observations of the true utility function (i.e. moral intuitions) over time. So the AI will keep the human alive and use them as a morality sensor. Similar to 1a, this fails in real life because it requires the human to specify their moral intuitions.
ii) The human is not aware of the AI
As in 1aii, the AI will learn about the human’s values from the human’s interaction with the world. This is different from 1aii in that the AI cannot assume that the human makes consistent decisions over time (because the human learns more about the utility function over time). However, in practice it is similar: the AI will manipulate the human into being an optimal morality sensor, only it will do this differently to account for the fact that the human gains moral updates over time.
b) The human is not a black box
As in 1b, the AI might more efficiently gain information about the utility function by taking apart the human’s brain. Then, it can predict the human’s actions in possible future situations. This includes predictions about what moral intuitions the human would communicate. Similar to 1b, this fails in real life because it requires the AI’s value learning algorithm to be correct.
4. One of the above, but the human also has uncertainty about mathematical statements
In this case the human solves the problem as before, except that they delegate mathematical uncertainty questions to the AI. For example, the human might write out their true utility function as a mathematical expression that contains difficult-to-compute numbers. This requires the AI to implement a solution to logical uncertainty, but even if we already had such a solution, this would still place unrealistic demands on the human (namely, reducing the value alignment problem to a mathematical problem).
Conclusion
This short overview of corrigibility models shows that simple uncertainty about the correct utility function is not sufficient for corrigibility. It is not clear what the correct solution to the hard problem of corrigibility is. Perhaps it will involve some model like those in this post in which humans are bounded in a specific way that causes them to desire corrigible AIs, or perhaps it will look completely different from these models.