Indifference and compensatory rewards

Crossposted at the Intelligent Agents Forum

It’s occurred to me that there is a framework where we can see all “indifference” results as corrective rewards, both for the utility function change indifference and for the policy change indifference.

Imagine that the agent has reward R₀ and is following policy π₀, and we want to change it to having reward R₁ and following policy π₁.

Then the corrective reward we need to pay it, so that it doesn’t attempt to resist or cause that change, is simply the difference between the two expected values:

V(R₀|π₀)-V(R₁|π₁),

where V is the agent’s own valuation of the expected reward, conditional on the policy.

This explains why off-policy reward-based agents are already safely interruptible: since we change the policy, not the reward, R₀=R₁. And since off-policy agents have value estimates that are indifferent to the policy followed, V(R₀|π₀)=V(R1|π₁), and the compensatory rewards are zero.