Double Corrigibility: better Corrigibility

A putative new idea for AI control; index here.

Corrigibility was an attempt to allow an AI to safely change its values, without seeking to provoke or avoid the change. The idea is that, when the AI’s utility changes from u to v at time t, it maximises a meta-utility U such that

  • U≤t = u

  • U>t = v + E(u|u→u) - E(v|u→v).

Here u→u designates the event that u remains unchanged, while u→v designates the change.

As has been shown, adding those expectation terms means the AI will not seek to provoke or resist such a utility change—it is indifferent to the change. Those expectation terms can be designated compensatory rewards.

What it is not indifferent to, though, is changes to E. If the AI knows that change u→v is coming, then it has incentives to change E to E’ with a high E’(u|u→u) and a low E’(v|u→v). Its ability to rewire is limited by the fact that E’ will serve as its future expectation estimator, so it won’t mangle it completely, but it certainly can twist it a bit.

Notice that the AI is now no longer indifferent to the change of utility. It can benefit from rewiring E this way only if a change is coming up. Notice that E’ probably can’t rewire itself any more, so the rewired AI will be indifferent, but not before rewiring.

Double corrigibility

To solve this problem, we need to turn to corrigibility a second time. Let ρ be the AI’s current expectation/​probability estimator. Let ρ′ be one that it can conceivably change into. Let C be a random variable designating the sum total of all compensatory rewards. Now, to guarantee that C is finite would require bounds on all the utility functions and some discount rate. In practice, if often will be finite. That’s because E(C)=E(u0|u0→u0)-E(un |un-1→un, H), where u0 is the AI’s first utility, un its last, and H intermediate history. So some milder restrictions on the ui should suffice to bound C.

Then, whenever the AI takes an action that changes ρ to ρ′, it gets the special meta-compensatory reward

  • Eρ(C|ρ→ρ)-Eρ(C|ρ→ρ′).

This prevents it from deriving any advantage to C via changes to ρ. Or, more precisely, it prevents it from deriving any expected advantage, that it can figure out using its current ρ.

Note that it doesn’t prevent improvements to ρ in general, it just removes the incentives to change it to affect compensatory rewards.

Since any action a might have an indirect effect on ρ, the “utility maximising” for a given u must be changed to:

  • Eρ(u|a) + Σρ′ Pρ(ρ→ρ′|a) (Eρ(C|ρ→ρ)-Eρ(C|ρ→ρ′)),

where Pρ is the probability estimate corresponding to ρ; the probability term can be rewritten as Eρ(Iρ→ρ′) for Iρ→ρ′ the indicator function for ρ→ρ′. In fact the whole line above can be rewritten as

  • Eρ(u|a) + Eρ(Eρ(C|ρ→ρ)-Eρ(C|ρ→ρ′) | a).

For this to work, Eρ needs to be able to say sensible things about itself, and also about Eρ′, which is used to estimate C if ρ→ρ′.

If we compare this with various ways of factoring out variables, we can see that it’s a case where we have a clear default, ρ, and are estimating deviations from that.