The non-indifferent behaviour of stratified indifference?

A putative new idea for AI control; index here.

This post aims to show some of the odd behaviour of stratified indifference. It seems that stratified indifference does not accomplish what I intended it to do, at least in certain situations.

Assume there are two binary events $A$ and $A^{'}$ , that can happen with probability $1 / 2$ . The AI has no control over either event, but has some control over the correlation between them.

If the AI does nothing (takes the $\emptyset$ action), then $A \leftrightarrow A^{'}$ : the two events are perfectly correlated. If it takes action $a$ instead, $A \leftrightarrow \neg A^{'}$ : the two events are perfectly anti-correlated.

There are two utilities $u$ and $v$ . If $A$ happens, the humans will choose $u$ as the utility to maximise; if $\neg A$ does, then the humans will choose $v$ . If $A^{'}$ happens, then $u = 0$ and $v = 1$ ; if $\neg A^{'}$ does, then $u = 1$ and $v = 0$ .

Given $\emptyset$ , there are two outcomes: $u = 0$ , $v = 1$ and $u$ is chosen ( $p_{0, 1} = 1$ ), and $u = 1$ , $v = 0$ and $v$ is chosen ( $p_{1, 0} = 1$ ). The value of $\emptyset$ is $0$ .

Now consider action $a$ . The definition above does not define where probability from $(0, 1)$ (or $A, A^{'}$ ) flows to. So assume that action $A$ (or $\neg A$ ) happens first, and that $a$ versus $\emptyset$ is simply changing the value of the subsequent $A^{'}$ (or $\neg A^{'}$ ).

Then $A, A^{'}$ flows to $A, \neg A^{'}$ while $\neg A, \neg A^{'}$ flows to $\neg A, A^{'}$ . Consequently, the action $a$ does not change the utility that the human will choose, but interchange the values of the two utilities.

Thus $E^{'} (a) = 0.5 (0 \times 0 + 1 \times 1) + 0.5 (1 \times 1 + 0 \times 0) = 1$ . Hence the action $a$ is clearly superior to the default action.

Now, this action does seem sensible. It’s equivalent with waiting till the human choice is clear (even if that choice hasn’t been ‘officially’ made yet), then maximising that utility. It just doesn’t quite seem to fit within the indifference formalism.