Stuart_Armstrong comments on Generalizing the Corrigibility paper’s impossibility result?

Stuart_Armstrong 6 Feb 2015 17:20 UTC
0 points
0
AF
I think we can improve the setup, by conditioning only on things the AI has not control over.

Imagine that every turn, there is a random bit B sent along a wire. It’s almost certainly 1, but there’s a tiny chance that it’s 0.

If the button is pressed and B=0, then the agent’s utility doesn’t update. But if the button is pressed and B=1, the utility is updated as usual.

Except the expression is conditioned not on “press”, but on “B=0″.

Now we just have to define B in such a way that the AI can’t affect it—we need a defined source of true randomness.

Even better: B has already been calculated by some defined past process, the AI just doesn’t know what it is yet.