[Question] What is wrong with this approach to corrigibility?

Essentially: a button that when you press it kills the AI but instantly gives it reward equal to its expected discounted future reward. And to be clear, this is the AI’s estimate of its expected discounted future reward, not some outside estimate.

(This is in the interest of asking dumb questions and learning in public. And I’m taking the specific narrow definition of corrigibility which is just an AI letting you turn it off.)

Thoughts: I’ve heard Dr. Christiano mention things like “precisely balanced incentives” (which I think uses to describe the weaknesses of this and similar approaches), but I don’t see why this would be particularly difficult to balance given that this number is just an explicit float inside many RL models. Some issues I do see:

  • Such an agent would have no incentive to create corrigible child-agents

  • Such an agent would have no incentive to preserve this property while self-modifying

But I am probably missing a bunch of other issues with this general approach? Responses in the form of links to relevant papers are welcome as well.

Thanks,

Raf