The non-indifferent behaviour of stratified indifference?

A putative new idea for AI control; index here.

This post aims to show some of the odd behaviour of stratified indifference. It seems that stratified indifference does not accomplish what I intended it to do, at least in certain situations.

Assume there are two binary events and , that can happen with probability . The AI has no control over either event, but has some control over the correlation between them.

If the AI does nothing (takes the action), then : the two events are perfectly correlated. If it takes action instead, : the two events are perfectly anti-correlated.

There are two utilities and . If happens, the humans will choose as the utility to maximise; if does, then the humans will choose . If happens, then and ; if does, then and .

Given , there are two outcomes: , and is chosen (), and , and is chosen (). The value of is .

Now consider action . The definition above does not define where probability from (or ) flows to. So assume that action (or ) happens first, and that versus is simply changing the value of the subsequent (or ).

Then flows to while flows to . Consequently, the action does not change the utility that the human will choose, but interchange the values of the two utilities.

Thus . Hence the action is clearly superior to the default action.

Now, this action does seem sensible. It’s equivalent with waiting till the human choice is clear (even if that choice hasn’t been ‘officially’ made yet), then maximising that utility. It just doesn’t quite seem to fit within the indifference formalism.

No comments.