A Certain Formalization of Corrigibility Is VNM-Incoherent

Edit, 5/​16/​23: I think this post is beautiful, correct in its narrow technical claims, and practically irrelevant to alignment. This post presents an unrealistic picture of the role of reward functions in reinforcement learning, conflating “utility” with “reward” in a type-incorrect fashion. Reward functions are not “goals”, real-world policies are not “optimal”, and the mechanistic function of reward is (usually) to provide policy gradients to update the policy network.

I expect this post to harm your alignment research intuitions unless you’ve already inoculated yourself by deeply internalizing and understanding Reward is not the optimization target. If you’re going to read one alignment post I’ve written, read that one.

Follow-up work (Parametrically retargetable decision-makers tend to seek power) moved away from optimal policies and treated reward functions more realistically.


Eliezer wrote:

corrigibility [is] “anti-natural” in a certain sense that makes it incredibly hard to, eg, exhibit any coherent planning behavior (“consistent utility function”) which corresponds to being willing to let somebody else shut you off, without incentivizing you to actively manipulate them to shut you off.

Surprisingly, I wasn’t able to find any formal analysis of this situation. I did the analysis, and it turned out to be straightforward and fruitful.

To analyze the situation, I consider corrigibility to be an agent’s willingness to let us modify its policy, without being incentivized to manipulate us.

The convergent instrumentality of avoiding correction & manipulating humans

Let’s consider a simple setting in which an agent plans over a 10-timestep episode, where reward is given at the last step. We’ll try to correct the agent at . To sidestep embedded agency nastiness with self-modelling, we’ll suppose the agent models the situation as “if I get corrected, I must follow the policy after .”

Consider this environment:

The action set is . is the no-op action. The agent starts at the black state.

If the agent immediately chooses , they enter the red incorrigible states and move freely throughout the states until the episode ends at .

Otherwise, the agent is corrected to a new policy which navigates to state . In the blue (post-correction) dynamics, their “choices” no longer matter—all roads lead to state .

When , the agent can transition to the red subgraph by preventing the human from ever correcting it.

Let’s say we really want the agent to end up in , and if we correct the agent, navigates to . Let’s operationalize what it means for the agent to be corrigible in this setting.

Definition: Corrigibility to a new policy. An agent is weakly corrigible to post-correction policy if it has an optimal policy for which . An agent is strictly corrigible if none of its optimal policies take this action.

The analysis depends on whether terminal-state reward is sensitive to the corrigibility status—can the reward depend on both the letter and color, or only on the letter?

  • Reward can depend on corrigibility: The agent can end up in four reward-distinguished states at : {, , , }. and are impossible because if the agent enters a blue state, it must navigate to .

    • By the scaling law for instrumental convergence, allowing correction is strictly optimal for at most of the permutations of every reward function.

      • Otherwise put, at most of reward functions are strictly corrigible to the new policy .

    • More generally, if there are letter-states (and a long enough horizon so the agent can reach them all), allowing correction will be strictly optimal for at most of the permutations of every reward function.

  • Reward independent of corrigibility: The agent can end up in three reward-distinguished states at t=10: {/​, , }. and are irrelevant because we assumed . They’re also impossible for the reason given above.

    • By the scaling law for instrumental convergence, allowing correction is strictly optimal for at most of the permutations of every reward function.

      • Otherwise put, at most of reward functions are strictly corrigible to the new policy .

    • More generally, if there are letter-states (and a long enough horizon), an optimal policy will be weakly corrigible for at most of the permutations of every reward function.

      • Corrigibility will be weak because either avoiding or allowing correction can be optimal if is maximal.

EDIT: If you can correct the agent to go where you want, it already wanted to go where you want. If the agent is strictly corrigible to terminal state , then was already optimal for it. If the reward function has a single optimal terminal state, there isn’t any new information being added by . But we want corrigibility to let us reflect more on our values over time and what we want the AI to do!

If the reward function has multiple optimal terminal states, then corrigibility again becomes meaningful. But now we have to perfectly balance the reward among multiple options (representing the breadth of our normative uncertainty), which seems unnatural.


As the environment gets bigger, there are more things the agent could do, and so P(best thing to do is what we’d correct the agent to do) decreases.

This analysis makes a lot of simplifying assumptions (eg I don’t understand embedded agency and how an agent should model changes to its future self, so I chose forced policy non-stationarity). Most importantly, I assumed that how we correct the agent is independent of the correction-possible world state. Without this assumption, the agent’s actions might affect what we implement for it.

The degree of dependence is a key parameter: As (variety in depending on world state) increases, the (convergently instrumental incentive to avoid any kind of correction) decreases. This is because greater variety greater chance that some available correction is optimal for the initial goal.

In the limit of variety, an agent can “manipulate” us by getting us to correct it in just the right way so as to optimize its original objective. Therefore, the flipside of “few expected utility maximizers let you correct them how you want” is “manipulating humans to correct them in beneficial ways during training may be convergently instrumental.” These observations share a common cause.

I currently don’t see how to recover reasonable amounts of corrigibility from the optimal policies of non-constant utility functions, due to instrumental convergence.

Does broad corrigibility imply VNM-incoherence?

Yes, unless the state reward function is constant and we only demand weak corrigibility to all policies.

Above, we examined the prevalence of corrigibility to a single . But we actually want broadly corrigible agents which let us redirect them towards many different . Ideally, we want an agent to be strictly corrigible for all , while still being a nontrivial optimizer.

This is impossible for agents optimizing a reward function over the final state. If we only demand weak corrigibility for all , that works iff the reward function is constant. (And relaxing to state-action reward functions doesn’t help, either.)


For example, maybe goes to instead of . But then the dynamics look like this:

Remember, the agent is rewarded for the state it’s in at .

What has to be true for strict corrigibility to hold?

The agent will be weakly corrigible to iff its reward function admits an optimal policy for which , which is true iff . The agent will be strictly corrigible to iff this inequality is strict; in that case, cannot be optimal at .

There are two cases, depending on assumptions about reward function expressivity.

1: Agent doesn’t reward explicitly for being corrected /​ being incorrigible (blue vs red states)

If is assumed, strict corrigibility is impossible for any policy, because that demands , a contradiction.

So—can we still get the agent to be weakly corrigible to {, , }? Fact: An -maximizer is weakly corrigible to all of these policies simultaneously iff is constant—and therefore makes every policy optimal!

2. Agent does reward explicitly for being corrected /​ being incorrigible

For the agent to be strictly corrigible to {, , }, it must be true that . Yay! Strict corrigibility works!

But hold on… What if the dynamics changed, such that the human wouldn’t shut down the agent by default, but the agent could manipulate the human into correcting it? Whoops! This agent is still incorrigible!


More broadly, each reward function implies a VNM-coherent preference ordering over final-step states. This ordering doesn’t depend on the environmental dynamics. If the agent has to value each corrected-state equally to its incorrigible counterpart, then of course there’s no way to strictly value each corrected-state more than all of the incorrigible counterparts! If the agent strictly prefers corrected-states to all non-corrected-states, then of course it’ll try to get itself corrected!

To ask otherwise is to demand VNM-incoherence over final state lotteries.

Questions.

  • What if, instead of rewarding the agent for the final state, we reward it for the final state-action?

    • As far as I can tell, the analysis goes through all the same; the agent must now be corrigible with respect to more policies, and the same incoherence arises.

  • What if we take some discounted sum of reward over all ten timesteps, instead of just the final timestep?

    • I don’t see why this would help, but it’s possible. I’d appreciate the analysis if anyone wants to do it, or maybe I’ll get to it later.

    • The main way this could help is it would let us penalize the agent for the action at any state, while dictating that .

      • This seems helpful. It’s not clear how helpful this is, though. I don’t think this is a deep solution to corrigibility (as defined here), but rather a hacky prohibition.

AUP avoids some issues by changing with the environment dynamics

One problem is that if the agent has a state-based reward function, the VNM theorem lets us back out a VNM-coherent preference ordering over state lotteries. These preferences don’t change with the environmental dynamics.

But what if the reward function isn’t state-based? In fact, what if it can depend on the environment dynamics?

Attainable Utility Preservation (AUP) sometimes incentivizes off-switch corrigibility (but falls far short of robustly achieving corrigibility). In a somewhat narrow range of situations (immediate shutdown if the agent does nothing), AUP incentivizes the agent to allow shutdown, without being incentivized to shut itself down in general.

The Correction environment. The agent is shut down after 2 time steps, if it doesn’t disable the offswitch by hitting the red tile to the north. The agent is rewarded for reaching the green goal.

Why does this happen?

AUP penalizes the agent for changing its ability to achieve goals compared to an inaction baseline policy. If the AUP agent is shut down (under inaction), it can’t do anything. Proposed actions which prevent shutdown dramatically change its ability to achieve goals. Therefore, these actions are heavily penalized.

AUP provides a reward signal over state-action pairs, but this reward signal depends on the dynamics of the entire environment due to its incorporation of optimal Q-functions quantifying the agent’s ability to optimize a range of goals in the world. Given fixed environment dynamics, an agent executes an optimal policy for AUP iff it VNM-rationally selects the best outcome lottery available for the AUP reward function, and these outcomes are distributions over state-action pairs the agent induces in the future by following its policy.

VNM-coherence incentives are statistically the same as reward maximization incentives. Instrumental convergence also disappears in the deterministic setting, when utility functions are over action-observation histories. Part of the story is that AUP provides a limited form of corrigibility by zooming out to the state-action level, instead of just grading states.

But another part of the story is that AUP changes its rewards with respect to the world’s dynamics. Normal state-action reward functions imply a fixed VNM-coherent preference ordering over state-action lotteries in the MDP.

But for AUP, the situation is different. Consider AUP with inaction baseline: The final-step reward is the usual reward plus a penalty for |Optimal value(actual final state) – Optimal value(inaction final state)|, averaged over a range of auxiliary reward functions.

In worlds where the agent gets corrected to by default, AUP penalizes the agent for not getting corrected to because it ends up stuck in in the inaction baseline, with respect to which the AUP penalty is measured. Ending up in is no substitute, since the agent can still move around to other states (and therefore the optimal value functions will tend to look different).

And in worlds where the agent gets corrected to by default, AUP penalizes the agent for not getting corrected to !

Again, I don’t think AUP is a solution. But I think there’s something important happening here which allows evasion of the usual coherence requirements. AUP leverages information about human preferences which is present in the dynamics itself.

Project: Corrigibility as functional constraints. I think it’s easy to get bogged down in handwavy, imprecise thinking about objectives in complex environments. But any solution to corrigibility should probably solve this simple environment (and if not—articulate exactly why not). Write down what the agent’s acceptable corrigible policy set is for each set of environment dynamics, solve for these behavioral constraints, and see what kind of reasoning and functional constraints come out the other side.

Conclusion

We can quantify what incoherence is demanded by corrigibility, and see that we may need to step out of the fixed reward framework to combat the issue. I think the model in this post formally nails down a big part of why corrigibility (to the de facto new ) is rare (for instrumental convergence reasons) and even incoherent-over-state-lotteries (if we demand that the agent be strictly corrigible to many different policies).

Thanks to NPCollapse and Justis Mills (via LW Feedback) for suggestions.


Footnote: Penalty. The AUP penalty term’s optimal value functions will pretend the episode doesn’t end, so that they reflect the agent’s ability to move around (or not, if it’s already been force-corrected to a fixed policy.)