Indifference utility functions

A putative new idea for AI control; index here.

During a workshop with MIRI at the FHI, I defined indifference via reward signals, saying something along the lines of “we can do it with proper utilities, but its more complicated”. I then never got round to defining them in terms of utilities.

I’ll do that now in this note.

Consider an AI that we want to (potentially) transition between utility and utility . Let be the event that we press the button to change the AI’s utility; let be the event that the change goes through (typically we’d want for some small ).

Let and be the indicator functions for those events. Then we can define the AI’s utility as:

  • .

Here, are the compensatory rewards .

Thus the AI maximises conditional on the button not being pressed or the utility change not going through. It maximises conditional on the button being pressed and the utility change going through. The compensatory rewards are there simply to make it behave like a pure maximiser up until the moment of button pressing.

No comments.