Pascal’s mugging in reward learning

This is the last post on the current value/​reward learning series. I’ll just bring up the danger of Pascal’s mugging in reward learning.

I’ve already shown how you can use the (p, R) model (with R a reward and p a planning algorithm, called a planner) to model AI actions "overriding" human preferences. That post used heroin as an override example, but here we’ll use an even more extreme example; the AI will do brain surgery to deliberately re-wire the human into following the policy of its choice:

The event S corresponds to the AI doing the surgery, the event ¬S to it not doing so. After this, the human will either follow the subsequent policy π(STA), a “standard” human policy according to some criteria, and π(MAX), another policy (why it is so named will soon become clear).

Let R(STA) and R(MAX) be rewards corresponding to π(STA) and π(MAX), under the assumption that the human policy is maximising them perfectly.

For any reward R, the value of that reward, V*(R), is the expected value of that reward if the AI acts perfectly to maximise it.

Now, the policy π(MAX) is the rational behaviour for maximising at least one reward R(MAX) - and, conversely, given R(MAX), there is a human policy π(MAX) that best maximises it. This allows us to the reward (and hence the policy) as:

  • R(MAX) = argmax(R) [V(R)|S], the reward with the most value conditional on S. Define Ω = max(R) [V(R)|S] = V*(R(MAX))|S.

Extend R(MAX) to the ¬S branch by defining R(MAX)′ = I(¬S)R(STA)+I(S)R(MAX), with I(S) the indicator function for S.

Consider two planners: pr, the rational planner, and pf, the planner that is rational conditional on ¬S, and maps everything to π(MAX), conditional on S.

There are three compatible planner-reward pairs: (pr, R(MAX)‘), (pf, R(STA)), and (pf, R(MAX)’). The first option corresponds to the human being fully rational, including after brain surgery from the AI, while the second encodes the fact that the brain surgery has overridden the human preference. The third pair also has an override, but, by pure coincidence, the override maps to the ideal policy the human wanted anyway.

Now, under any set of reasonable normative assumptions, (pf, R(STA)) is the correct pair. But, as good Bayesians, we don’t want to set probabilities to zero, so assume

  • P(pr, R(MAX)‘) + P(pf, R(MAX)’) = ε.

Thus the probability of the reward R(MAX)′ is ε.

The potential Pascal’s mugging comes from the comparison between Ω, and V(R(STA)), the “standard” expected value. If Ωε > V*(R(STA)) (a slight simplification of the true formula), then the AI will perform surgery on the human to get that maximal value.

How likely is Ω to be that big? The problem is that it depends entirely on the set of allowable rewards. The definition of Ω is the maximal value the AI could get, if it were free to choose the reward it wanted to maximise. Depending on how we define the class of rewards, this could be way larger than V*(R(STA)).

(We may be able to solve this through proper normalisation of the rewards (maybe dividing by their V* value), but I’m not sure if the idea will always work, if we do a one-off normalisation at the very beginning, and don’t update as we go through—and updating would lead to time-inconsistencies).