Double Corrigibility: better Corrigibility

This post is obsolete, replaced by this one.

EDIT: Defining and the conditions on is a bit more complicated than presented here, still have some details to formalise.

Corrigibility was an attempt to allow an AI to safely change its values, without seeking to provoke or avoid the change. The idea is that, when the AI’s utility changes from to at time , it maximises a meta-utility such that

  • .

  • .


Here designates the event that remains unchanged, while designates the change.

As has been shown, adding those expectation terms means the AI will not seek to provoke or resist such a utility change—it is indifferent to the change. Those expectation terms can be designated compensatory rewards.

What it is not indifferent to, though, is changes to . If the AI knows that change is coming, then it has incentives to change to with a high and a low . Its ability to rewire is limited by the fact that will serve as its future expectation estimator, so it won’t mangle it completely, but it certainly can twist it a bit.

Notice that the AI is now no longer indifferent to the change of utility. It can benefit from rewiring this way only if a change is coming up. Notice that probably can’t rewire itself any more, so the rewired AI will be indifferent, but not before rewiring.

Double corrigibility

To solve this problem, we need to turn to corrigibility a second time. Let be the AI’s current expectation/​probability estimator. Let be one that it can conceivably change into. Let be a random variable designating the sum total of all compensatory rewards. Now, to guarantee that is finite would require bounds on all the utility functions and some discount rate. In practice, if often will be finite. That’s because , where is the AI’s first utility and its last. So some milder restrictions on the should suffice to bound .

Then, whenever the AI takes an action that changes to , it gets the special meta-compensatory reward

  • .

This prevents it from deriving any advantage to via changes to . Or, more precisely, it prevents it from deriving any expected advantage, that it can figure out using its current .

Note that it doesn’t prevent improvements to in general, it just removes the incentives to change it to affect compensatory rewards.

Since any action might have an indirect effect on , the “utility maximising” for a given must be changed to:

  • ,

where is the probability estimate corresponding to ; the probability term can be rewritten as for the indicator function for . In fact the whole line above can be rewritten as

  • .

For this to work, needs to be able to say sensible things about itself, and also about , which is used to estimate if .

If we compare this with various ways of factoring out variables, we can see that it’s a case where we have a clear default, , and are estimating deviations from that.