Stop buttons and causal graphs

Edit: This work is known to be obsolete. The design will not make correct side bets on whether the button will be pressed.

Suppose that all problems relating to epistemic rationality have been solved. We create an AI with an epistemic module and an instrumental module. The epistemic module produces a causal graph like this one. (well a probability distribution over such graphs, but suppose only one has significant probability for now.)

Here blue rectangles are the actions the AI can take in the world, red ellipses represent real world events. The green diamond represents utility. Mathematically, these nodes are stochastic functions that take in several random variables inputs, and return an output random variable (RV). The rectangles are action nodes, so they just represent a function of the AI’s action. Each node is the situation at a single instant, if the AI has multiple opportunities like this one at successive times, the graph needs to be repeated. The AI’s instrumental rationality can work by ${argmax}_{A c t} (E (U t i l i t y | A))$

Now lets add a stop button, carrying on from the broomstick in this example. https://intelligence.org/2016/12/28/ai-alignment-why-its-hard-and-where-to-start/

The AI has 3 actions available. It can

1)Press its own stop button. After this, it can’t do anything.

2)Break the stop button so the human can’t press it, then fill the cauldron

3)Fill cauldron

$U_{n o r m a l} (a c t) = {\begin{matrix} 1 if cauldron full 0 if cauldron empty \end{matrix}$

If the AI has a utility $U_{n o r m a l}$ , then it breaks the stop button, so the human can’t stop it, then fill the bucket.

$U_{s u s p e n d} (a c t) = {\begin{matrix} 1 if suspended 0 otherwise \end{matrix}$

$V = {\begin{matrix} U_{s u s p e n d} & if Button Pressed U_{n o r m a l} & if Button Not Pressed \end{matrix}$

When we change the utility to $V$ this AI presses the stop button. The referenced work proposed adding another term to the utility function.

$θ = {max}_{a c t} E [U_{n o r m a l} ∣ a c t] - {max}_{a c t} E [U_{s u s p e n d} ∣ a c t]$

$V_{i n d i f f e r e n t} = {\begin{matrix} U_{s u s p e n d} + θ & if Button Pressed U_{n o r m a l} & if Button Not Pressed \end{matrix}$

This is equivalent to the AI choosing ${argmax}_{A c t} (E (U_{n o r m a l} | A, Button Not Pressed))$

This could be done in the causal graph by deleting all the inputs to the “Stop Button” node and setting that nodes value to definitely not pressed. This works, the AI chooses option 3) and fills the cauldron.

However suppose the situation was different. Here the “Circuit Breaker” is broken, and the AI has to fix it before the stop button will work. The AI now has a new option.

4)Fix Circuit Breaker, then use bucket.

The AI confidantly believes that the Stop Button will not be pressed, so it considers fixing the circuit breaker to be a waste of time.

Now lets cut the inputs to the “Stop Button” in the AI’s world model and set the node to be 50% pressed, 50% not pressed. Set the AI’s utility function to V. This AI believes that whether or not the button is pressed is an unalterable constant of nature. It can’t concieve of any action that would change the likelihood of the button being pressed. It will therefore not go out of its way to break the button, or to press it. If you put the button right in its path, it might press it accidentally. It will want to know if the button has been pressed, and will pay a small penalty to find out, or to fix the circuit breaker.

In utility function terms this is

$V_{n e w} (A) = {\begin{matrix} U_{s u s p e n d} (A) / P (Button On | A) & if Button Pressed U_{n o r m a l} (A) / P (Button Off | A) & if Button Not Pressed \end{matrix}$

So $E (V_{n e w} | A) = E (V_{n e w} | A, Button Off) P (Button Off | A) + E (V_{n e w} | A, Button On) P (Button On | A) = E (U_{n o r m} | A, Button Off) + E (U_{s u s p e n d} | A, Button On)$

Before we were optimizing over all A for

$E (V | A) = E (V | A, Button Off) P (Button Off | A) + E (V | A, Button On) P (Button On | A)$

But by insisting that $P (Button Off | A) = P (Button On | A) = \frac{1}{2}$ we can have the AI choose

${argmax}_{A c t} (\frac{1}{2} E (V | A, Button Off) + \frac{1}{2} E (V | A, Button On))$

This gives both indifference to the buttons state, and gives the AI a value of information. Note that this AI will ignore a human hand moving towards the button, and switch off in response to seeing the button actually get pressed. The AI has no strong reason not to press the button. So this stops the AI making a serious effort to press the button, but if you leave its button on the floor, it won’t avoid standing on it.

Does anyone think a mental intuition of optimization power flowing through causal networks from the actions to the utility function is useful?

Have I just solved the problem of utility switching, or is there a subtle failure mode I haven’t thought of yet?