Indifference and compensatory rewards

A putative new idea for AI control; index here.

It’s occurred to me that there is a framework where we can see all “indifference” results as corrective rewards, both for the utility function change indifference and for the policy change indifference.

Imagine that the agent has reward $R_{0}$ and is following policy $π_{0}$ , and we want to change it to having reward $R_{1}$ and following policy $π_{1}$ .

Then the corrective reward we need to pay it, so that it doesn’t attempt to resist or cause that change, is simply the difference between the two expected values:

$V (R_{0} | π_{0}) - V (R_{1} | π_{1})$ ,

where $V$ is the agent’s own valuation of the expected reward, conditional on the policy.

This explains why off-policy reward-based agents are already safely interruptible: since we change the policy, not the reward, $R_{0} = R_{1}$ . And since off-policy agents have value estimates that are indifferent to the policy followed, $V (R_{0} | π_{0}) = V (R_{1} | π_{1})$ , and the compensatory rewards are zero.