Summary: if we’re not sure what the right utility function is, we might use the minimax decision rule
to create a low-impact AI that is not bad for any possible utility function. There are also some flawed ways to combine minimax with value learning and corrigibility that might be worth improving on. This is a writeup of one idea from my visit at MIRI last week, during which I worked mostly with Benja and also with Nate to some extent.
The minimax decision rule
The minimax decision rule maximizes the minimum expected utility under some set of uncertain parameters.
The uncertainty that a minimaxer has over these parameters can be considered a form of Knightian uncertainty.
For reduced-impact AI it will be useful to consider the utility function itself as an unknown parameter.
Suppose we have some set V of utility function representatives.
A utility function representative is a function mapping each outcome to a real number, defining a VNM preference relation; notably,
a single VNM preference relation can correspond with multiple utility function representatives, which are translations
and scalings of each other. Now we could initially define the minimax rule as:
argmaxπ∈Πminv∈VE[v(X)|O;π]
where π is the agent’s policy (contained in the set Π), O consists of
some universe-locating observations programmed into the AI,
and X is the outcome.
O is meant to locate our universe in a way that prevents the minimaxer
from making decisions from “behind the veil” (which might cause it to,
say, optimize v1 at the expense of v2 in our universe
and vice versa in another universe).
Minimaxers are not VNM in general due to failing the axiom of independence.
Why might we consider programming a reduced-impact AI to use minimax? If we are uncertain
about the correct utility function, but know that it is in some set,
then we could define V to contain a representative for each utility function.
To reduce impact, we would like each representative to assign 0 utility to the
expected status quo given O (i.e. ∀v∈V:E[v(X)|O;π=shutdown]=0). This way, the AI will only take actions if no utility function
in our set loses out relative to the status quo (conditioned on O). This is similar to
Stuart Armstrong’s satisficer design.
With such a design, the AI might be unable to take any actions other than shutting down
(for example, if for some v both v and −v appear in V),
but at least it will not reduce any utility in expectation given O (compared to when
it shuts down immediately).
Note that since these expected utility guarantees talk about expected utilities
given O, we will want the AI to have all the information we have in order to
be confident that, from our perspective, expected utility does not decrease.
It is not enough for the creation of the minimaxer to be good a priori: it
must also be good given the information we already know about the world.
The fact that it is dangerous not to give the system enough observations
is worrying because (1) we don’t expect to be able to write down all the information in our memory, and (2) it might also indicate that imperfect inference algorithms
would make the minimaxer dangerous. Also note that evaluating the expected
utility of the status quo relies on hard-to-calculate logical information (such as
the probability that people create friendly AI), and it is not clear what happens
when we don’t have logical omniscence.
Combining minimax with value learning
Often, we don’t just want to minimax over possible utility functions, we also want to learn
the utility function. These goals are not necessarily opposed. If we are in a
value learning framework, then we believe that the distribution of X
as a function of π depends on
the correct utility function representative v, so we should rewrite the rule:
argmaxπ∈Πminv∈VE[v(X)|O;π,v]
Note that we are not concerned with the prior distribution over v, only
the distribution over outcomes and observations as a function of π and v. This resembles
frequentism. In fact, we can use frequentist statistics to select minimax policies
(similar to minimax estimators).
Consider the following toy example. There are 2 utility functions,
v1 and v2. Assume that there are no universe-locating observations. We believe that our next observation O′ (which may be
either o1 or o2) depends on the utility function. Specifically,
P(O′=o1;v1)=P(O′=o2;v2)=0.75. After seeing the observation,
we may choose the outcome x1,x2, or x3. Utilities are as follows:
v1(x1)=v2(x2)=5v1(x2)=v2(x1)=0v1(x3)=v2(x3)=3
So x1 is good for v1, x2 is good for v2, and x3 is a compromise.
If we could not see any observations, then the minimax rule would select
x3. However, consider an alternative policy πl. If we see o1, select
x1, and if we see o2, select x2.
Now we have E[v1(X);πl,v1]=E[v2(X);πl,v2]=0.75⋅5=3.75>3. Therefore, minimax will prefer
this policy to the one that always selects x3.
What this tells us is that, if we know what distribution over
possible worlds (including counterfactuals) we expect for each
utility function, we might be able to use minimax
to conservatively learn the utility function without
committing to any prior over utility functions. In fact, we could even have
Knightian uncertainty over the distribution over
possible worlds to expect for each utility function.
However, since we’re also conditioning on O, we can’t learn
anything about the utility function that O screens off.
If we want O to contain quite a lot of information, then this is
a serious problem: we won’t actually learn anything new
about v from future observations, so we do not take into account the fact that the outcome
distribution depends on v. I currently don’t see a clear way to resolve
this that doesn’t relax the safety guarantees we get from conditioning
on O.
Combining minimax with corrigibility
When creating a corrigible system, we want an AI try to maximize some objective
while behaving relatively conservatively and responding to shutdown signals (because it might be maximizing the wrong objective).
As in Stuart Armstrong’s satisficer design, we can add an ϵu term
to make the system try to conservatively optimize u. Specifically, we choose our policy as follows:
argmaxπ∈Π(ϵE[u(X)|O;π,u]+minv∈VE[v(X)|O;π,v])
This policy will avoid reducing any v∈V too much, while trying to optimize
u. If (say) the user pressing a shutdown button is less likely given
that u is the correct utility function, then the optimal policy might
optimize u more in worlds where the shutdown button is pressed
and less in worlds where the shutdown button is not pressed (i.e. it
will behave more conservatively when the shutdown button is pressed).
Of course, this assumes we already know that the shutdown button
counts as evidence that we have the wrong utility function, and
in a way not screened off by O.
Unfortunately, we are not actually guaranteed that the AI will immediately shut down. Perhaps it reduced some v values in the course of trying to optimize u before learning that u was the wrong utility function. Then the next action (upon finding that u is wrong) might be to increase some of the lowest v values. This could be quite dangerous! Although we can place bounds on how much any v can decrease in expectation (which get tighter the lower ϵ is), we might not have enough confidence in the system to trust it to make these calculations instead of just shutting down.
Minimax as an approach to reduced-impact AI
Summary: if we’re not sure what the right utility function is, we might use the minimax decision rule to create a low-impact AI that is not bad for any possible utility function. There are also some flawed ways to combine minimax with value learning and corrigibility that might be worth improving on. This is a writeup of one idea from my visit at MIRI last week, during which I worked mostly with Benja and also with Nate to some extent.
The minimax decision rule
The minimax decision rule maximizes the minimum expected utility under some set of uncertain parameters. The uncertainty that a minimaxer has over these parameters can be considered a form of Knightian uncertainty.
For reduced-impact AI it will be useful to consider the utility function itself as an unknown parameter. Suppose we have some set V of utility function representatives. A utility function representative is a function mapping each outcome to a real number, defining a VNM preference relation; notably, a single VNM preference relation can correspond with multiple utility function representatives, which are translations and scalings of each other. Now we could initially define the minimax rule as:
argmaxπ∈Πminv∈VE[v(X)|O;π] where π is the agent’s policy (contained in the set Π), O consists of some universe-locating observations programmed into the AI, and X is the outcome. O is meant to locate our universe in a way that prevents the minimaxer from making decisions from “behind the veil” (which might cause it to, say, optimize v1 at the expense of v2 in our universe and vice versa in another universe). Minimaxers are not VNM in general due to failing the axiom of independence.
Why might we consider programming a reduced-impact AI to use minimax? If we are uncertain about the correct utility function, but know that it is in some set, then we could define V to contain a representative for each utility function. To reduce impact, we would like each representative to assign 0 utility to the expected status quo given O (i.e. ∀v∈V:E[v(X)|O;π=shutdown]=0). This way, the AI will only take actions if no utility function in our set loses out relative to the status quo (conditioned on O). This is similar to Stuart Armstrong’s satisficer design. With such a design, the AI might be unable to take any actions other than shutting down (for example, if for some v both v and −v appear in V), but at least it will not reduce any utility in expectation given O (compared to when it shuts down immediately).
Note that since these expected utility guarantees talk about expected utilities given O, we will want the AI to have all the information we have in order to be confident that, from our perspective, expected utility does not decrease. It is not enough for the creation of the minimaxer to be good a priori: it must also be good given the information we already know about the world. The fact that it is dangerous not to give the system enough observations is worrying because (1) we don’t expect to be able to write down all the information in our memory, and (2) it might also indicate that imperfect inference algorithms would make the minimaxer dangerous. Also note that evaluating the expected utility of the status quo relies on hard-to-calculate logical information (such as the probability that people create friendly AI), and it is not clear what happens when we don’t have logical omniscence.
Combining minimax with value learning
Often, we don’t just want to minimax over possible utility functions, we also want to learn the utility function. These goals are not necessarily opposed. If we are in a value learning framework, then we believe that the distribution of X as a function of π depends on the correct utility function representative v, so we should rewrite the rule: argmaxπ∈Πminv∈VE[v(X)|O;π,v] Note that we are not concerned with the prior distribution over v, only the distribution over outcomes and observations as a function of π and v. This resembles frequentism. In fact, we can use frequentist statistics to select minimax policies (similar to minimax estimators).
Consider the following toy example. There are 2 utility functions, v1 and v2. Assume that there are no universe-locating observations. We believe that our next observation O′ (which may be either o1 or o2) depends on the utility function. Specifically, P(O′=o1;v1)=P(O′=o2;v2)=0.75. After seeing the observation, we may choose the outcome x1,x2, or x3. Utilities are as follows:
v1(x1)=v2(x2)=5 v1(x2)=v2(x1)=0 v1(x3)=v2(x3)=3
So x1 is good for v1, x2 is good for v2, and x3 is a compromise. If we could not see any observations, then the minimax rule would select x3. However, consider an alternative policy πl. If we see o1, select x1, and if we see o2, select x2. Now we have E[v1(X);πl,v1]=E[v2(X);πl,v2]=0.75⋅5=3.75>3. Therefore, minimax will prefer this policy to the one that always selects x3.
What this tells us is that, if we know what distribution over possible worlds (including counterfactuals) we expect for each utility function, we might be able to use minimax to conservatively learn the utility function without committing to any prior over utility functions. In fact, we could even have Knightian uncertainty over the distribution over possible worlds to expect for each utility function.
However, since we’re also conditioning on O, we can’t learn anything about the utility function that O screens off. If we want O to contain quite a lot of information, then this is a serious problem: we won’t actually learn anything new about v from future observations, so we do not take into account the fact that the outcome distribution depends on v. I currently don’t see a clear way to resolve this that doesn’t relax the safety guarantees we get from conditioning on O.
Combining minimax with corrigibility
When creating a corrigible system, we want an AI try to maximize some objective while behaving relatively conservatively and responding to shutdown signals (because it might be maximizing the wrong objective). As in Stuart Armstrong’s satisficer design, we can add an ϵu term to make the system try to conservatively optimize u. Specifically, we choose our policy as follows: argmaxπ∈Π(ϵE[u(X)|O;π,u]+minv∈VE[v(X)|O;π,v]) This policy will avoid reducing any v∈V too much, while trying to optimize u. If (say) the user pressing a shutdown button is less likely given that u is the correct utility function, then the optimal policy might optimize u more in worlds where the shutdown button is pressed and less in worlds where the shutdown button is not pressed (i.e. it will behave more conservatively when the shutdown button is pressed). Of course, this assumes we already know that the shutdown button counts as evidence that we have the wrong utility function, and in a way not screened off by O.
Unfortunately, we are not actually guaranteed that the AI will immediately shut down. Perhaps it reduced some v values in the course of trying to optimize u before learning that u was the wrong utility function. Then the next action (upon finding that u is wrong) might be to increase some of the lowest v values. This could be quite dangerous! Although we can place bounds on how much any v can decrease in expectation (which get tighter the lower ϵ is), we might not have enough confidence in the system to trust it to make these calculations instead of just shutting down.