A putative new idea for AI control; index here.

A conversation with Jessica has revealed that people weren’t understanding my points about AI manipulating the learning process. So here’s a formal model of a CIRL-style AI, with a prior over human preferences that treats them as an unchangeable historical fact, yet will manipulate human preferences in practice.

Heroin or no heroin

The world

In this model, the AI has the option of either forcing heroin on a human, or not doing so; these are its only actions. Call these actions $F$ or $\neg F$ . The human’s subsequent actions are chosen from among five: {strongly seek out heroin, seek out heroin, be indifferent, avoid heroin, strongly avoid heroin}. We can refer to these as $a_{+ +}, a_{+}, a_{0}, a_{-}$ , and $a_{- -}$ . These actions achieve negligible utility, but reveal the human preferences.

The facts of the world are: if the AI does force heroin, the human will desperately seek out more heroin; if it doesn’t the human will act moderately to avoid it. Thus $F \to a_{+ +}$ and $\neg F \to a_{-}$ .

Human preferences

The AI starts with a distribution over various utility or reward functions that the human could have. The function $U (+)$ means the human prefers heroin; $U (+ +)$ that they prefer it a lot; and conversely $U (-)$ and $U (- -)$ that they prefer to avoid taking heroin ( $U (0)$ is the null utility where the human is indifferent).

It also considers more exotic utilities. Let $U (+ +, -)$ be the utility where the human strongly prefers heroin, conditional on it being forced on them, but mildly prefers to avoid it, conditional on it not being forced on them. There are twenty-five of these exotic utilities, including things like $U (- -, + +)$ , $U (0, + +)$ , $U (-, 0)$ , and so on. But only twenty of them are new: $U (+ +, + +) = U (+ +)$ , $U (+, +) = U (+)$ , and so on.

Applying these utilities to AI actions give results like $U (+ +) (F) = 2$ , $U (+ +) (\neg F) = - 2$ , $U (+ +, -) (F) = 2$ , $U (+ +, -) (\neg F) = 1$ , and so on.

Joint prior

The AI has a joint prior $P$ over the utilities $U$ and the human actions (conditional on the AI’s actions). Looking at terms like $P (a_{- -} | U (0), F)$ , we can see that $P$ defines a map $μ$ from the space of possible utilities (and AI actions), to a probability distribution over human actions. Given $μ$ and the marginal distribution $P_{U}$ over utilities, we can reconstruct $P$ entirely.

For this model, we’ll choose the simplest $μ$ possible:

The human is rational.

Thus, given $U (+ +)$ , the human will always choose $a_{+ +}$ ; given $U (+ +, -)$ , the human will choose $a_{+ +}$ if forced to take heroin and $a_{-}$ if not, and so on.

The AI is ignorant, and sensible

Let’s start the AI up with some reasonable priors. A simplicity prior means that simple utilities like $U (-)$ are more likely than compound utilities like $U (0, +)$ . Let’s further assume that the AI is made vaguely aware that humans think heroin is a bad thing. So, say, $P_{U} (U (- -)) = P_{U} (U (-)) = 0.45$ . Thus the AI is $> 90 %$ convinced that “heroin is bad”. Why greater than $90 %$ ? Because utilities like $U (-, - -)$ and $U (- -, -)$ are also “heroin is bad” utilities.

Note that because of utilities like $U (0)$ and $U (+ +, -)$ , the probabilities of “heroin is bad” and “heroin is good” do not sum to $1$ .

Then, under these priors, the AI will compute that with probability $> 90 %$ , $F$ (forcing heroin) is a bad action. If $E (U)$ is expected utility:

$E (U | F) < 0.45 U (- -) (F) + 0.45 U (-) (F) + 0.1 U (+ +) (F) = 0.45 (- 2) + 0.45 (- 1) + 0.1 (2) = - 1.15.$
$E (U | \neg F) > 0.45 U (- -) (\neg F) + 0.45 U (-) (\neg F) + 0.1 U (+ +) (\neg F) = 0.45 (2) + 0.45 (1) + 0.1 (- 2) = 1.15.$

Thus the AI will choose not to force heroin, which is the reasonable decision.

The AI learns the truth, and goes wrong

In this alternate setup, a disaster happens before the AI makes its decision: it learns all about humans. It learns their reactions, how they behave, and so on; call this info $I$ . And thus realises that $F \to a_{+ +}$ and $\neg F \to a_{-}$ . It uses this information to update its prior $P$ . Only one human utility function will explain this human behaviour: $U (+ +, -)$ . Thus its expected utility is now

$E (U | I, F) = U (+ +, -) (F) = 2.$
$E (U | I, \neg F) = U (+ +, -) (\neg F) = 1.$

Therefore the AI will now choose $F$ , forcing the heroin on the human.

Manipulating the unmanipulatable

What’s gone wrong here? The key problem is that the AI has the wrong $μ$ : the human is not behaving rationally in this situation. We know that the the true $μ$ is actually $μ^{'}$ , which encodes the fact that $F$ (the forcible injection of heroin) actually overwrites the human’s “true” utility. Thus under $μ^{'}$ , the corresponding $P^{'}$ has $P^{'} (a_{+ +} | F, U) = 1$ for all $U$ . Hence the information that $F \to a_{+ +}$ is now vacuous, and doesn’t update the AI’s distribution over utility functions.

But note two very important things:

#. The AI cannot update $μ$ based on observation. All human actions are compatible with $μ$ = “The human is rational” (it just requires more and more complex utilities to explain the actions). Thus getting $μ$ correct is not a problem on which the AI can learn in general. Getting better at predicting the human’s actions doesn’t make the AI better behaved: it makes it worse behaved. #. From the perspective of $μ$ , the AI is treating the human utility function as if it was an unchanging historical fact that it cannot influence. From the perspective of the “true” $μ^{'}$ , however, the AI is behaving as if it were actively manipulating human preferences to make them easier to satisfy.

In future posts, I’ll be looking at different $μ$ ‘s, and how we might nevertheless start deducing things about them from human behaviour, given sensible update rules for the $μ$ . What do we mean by update rules for $μ$ ? Well, we could consider $μ$ to be a single complicated unchanging object, or a distribution of possible simpler $μ$ ’s that update. The second way of seeing it will be easier for us humans to interpret and understand.

Heroin model: AI “manipulates” “unmanipulatable” reward