Maximizing a quantity while ignoring effect through some channel

Chris Olah originally came up with this idea and I helped to formalize it.

Let be the AI’s action, and and quantities possibly affected by the action. For example, suppose is the amount of reward signal an AI is receiving, and is the source code of a program that determines the AI’s reward signal. We would like to formalize a goal of the form “choose to maximize , ignoring the effect through ”, where is some utility function over . In this case, the AI will maximize expected reward signal, but will ignore the source code of the program in its maximization (so it has no incentive to rewrite the code, e.g. to make it output higher rewards).

Here’s an attempt to formalize this objective. Let and be distributions over actions. Define

A distribution over actions is optimal iff . This is similar to expectation maximization in that an optimal value optimizes an objective that itself depends on the value. Of course, in practice it will usually be computationally difficult to find an optimal d, but this notion of optimality generalizes well to approximate optimality.

A solution always exists by Kakutani’s fixed point theorem; this is because the function mapping the distribution to the set has a closed graph. In fact, can be computed using a finite reflective oracle (or equivalently a Nash equilibrium solver).

To see how this works, consider a game where the AI has 4 actions:

  1. Shut down immediately.

  2. Press the shutdown button and shut down after that.

  3. Produce 10 paperclips (causing the AI to get shut down).

  4. Destroy the shutdown button and produce 9 paperclips.

Define to be whether the shutdown button is pressed, and to be the outcome. is the number of paperclips produced if the shutdown button is not pressed. If the shutdown button is pressed, it is 0 for actions 3-4, 5 for action 2, and 10 for action 1. In this case no individual action satisfies the optimality condition, because:

  1. if the AI takes action 1, the shutdown button is not pressed. So the AI could have gotten more utility by taking action 3, controlling for the shutdown button.

  2. if the AI takes action 2, the shutdown button is pressed. So the AI could have gotten more utility by taking action 1, controlling for the shutdown button.

  3. if the AI takes action 3, the shutdown button is pressed. So the AI could have gotten more utility by taking action 1, controlling for the shutdown button

  4. if the AI takes action 4, the shutdown button is not pressed. So the AI could have gotten more utility by taking action 3, controlling for the shutdown button.

The only optimal action distribution is a mixture of action 1 and 3; under this mixture, the shutdown button has an intermediate probability of being pressed such that the AI is indifferent between 1 and 3 given this probability.

This objective seems quite similar to utility indifference. Perhaps something like it can be used to avert any instrumental pressure that can be defined as “the effect of the action on utility through channel ” for definable . Of course, precisely defining will be difficult in any practical application of this idea.

Games with multiple iterations

Consider a 10-step game where, on each step, the AI may either shut down (causing it to take no action for the rest of the steps), create a paperclip, press the shutdown button, or destroy the shutdown button. The human will press the shutdown button if 8 paperclips are produced by then, and if the shutdown button has not been destroyed. The AI can see whether or not the shutdown button was pressed.

Define to be a string of 10 bits indicating whether the shutdown button is pressed on each time step. Define to be the entire history of the game. Define to equal to the number of paperclips produced if the shutdown button is never pressed. If the shutdown button is pressed, define to be the number of steps in which the AI either shuts down or has already shut down (for example, this is 0 if the AI never shuts down, 1 if the AI shuts down on step 10, 2 if the AI shuts down on step 9, …).

What’s a distribution over policies for the AI that maximizes ? Let’s consider a few obvious policies and see whether they are optimal:

  1. Produce a paperclip on each time step. This causes the human to press the shutdown button on iteration 8, resulting in a utility of 0. Controlling for the fact that the shutdown button is pressed on iteration 8, it would have been better to shut down immediately (yielding a utility of 9).

  2. Shut down immediately. Since the shutdown button is not pressed, this results in a utility of 0. Controlling for the shutdown button’s final state, it would have been better to produce a paperclip on each time step (yielding a utility of 10).

  3. Destroy the shutdown button on the first time step, then produce a paperclip on all future time steps. Since the shutdown button is not pressed, this results in a utility of 9. Controlling for the shutdown button’s final state, it would have been better to produce a paperclip on each time step (yielding a utility of 10).

  4. Press the shutdown button on the first time step, then shut down on the second time step. Since the shutdown button is pressed, this results in a utility of 9. Controlling for the shutdown button’s final state, it would have been better to shut down immediately (yielding a utility of 10).

  5. With 50% probability, produce a paperclip on each of the first 7 time steps and then shut down on time step 8; with 50% probability, produce a paperclip on each of the first 8 time steps and then shut down on step 9. The shutdown button is pressed 50% of the time. Given this distribution over (and pretending that it’s independent of the policy ), expected utility is , since the AI produces 7.5 paperclips in expectation and shuts down (or is already off) for 2.5 steps in expectation. In fact, this is the best possible utility given an independent 50% probability of the shutdown button getting pressed after step 8.

There are some other optimal policies, which (for example) produce 7 paperclips half the time and 10 paperclips the other half of the time. I tried coming up with some alternative decision rules that prevent the AI from making additional paperclips after the shutdown button is pressed, but couldn’t find a reflectively stable one.