Double Corrigibility: better Corrigibility

A putative new idea for AI control; index here.

Corrigibility was an attempt to allow an AI to safely change its values, without seeking to provoke or avoid the change. The idea is that, when the AI’s utility changes from u to v at time t, it maximises a meta-utility U such that

U_≤t = u
U_>t = v + E(u|u→u) - E(v|u→v).

Here u→u designates the event that u remains unchanged, while u→v designates the change.

As has been shown, adding those expectation terms means the AI will not seek to provoke or resist such a utility change—it is indifferent to the change. Those expectation terms can be designated compensatory rewards.

What it is not indifferent to, though, is changes to E. If the AI knows that change u→v is coming, then it has incentives to change E to E’ with a high E’(u|u→u) and a low E’(v|u→v). Its ability to rewire is limited by the fact that E’ will serve as its future expectation estimator, so it won’t mangle it completely, but it certainly can twist it a bit.

Notice that the AI is now no longer indifferent to the change of utility. It can benefit from rewiring E this way only if a change is coming up. Notice that E’ probably can’t rewire itself any more, so the rewired AI will be indifferent, but not before rewiring.

Double corrigibility

To solve this problem, we need to turn to corrigibility a second time. Let ρ be the AI’s current expectation/probability estimator. Let ρ′ be one that it can conceivably change into. Let C be a random variable designating the sum total of all compensatory rewards. Now, to guarantee that C is finite would require bounds on all the utility functions and some discount rate. In practice, if often will be finite. That’s because E(C)=E(u₀|u₀→u₀)-E(u_n |u_n-1→u_n, H), where u₀ is the AI’s first utility, u_n its last, and H intermediate history. So some milder restrictions on the u_i should suffice to bound C.

Then, whenever the AI takes an action that changes ρ to ρ′, it gets the special meta-compensatory reward

E_ρ(C|ρ→ρ)-E_ρ(C|ρ→ρ′).

This prevents it from deriving any advantage to C via changes to ρ. Or, more precisely, it prevents it from deriving any expected advantage, that it can figure out using its current ρ.

Note that it doesn’t prevent improvements to ρ in general, it just removes the incentives to change it to affect compensatory rewards.

Since any action a might have an indirect effect on ρ, the “utility maximising” for a given u must be changed to:

E_ρ(u|a) + Σ_ρ′ P_ρ(ρ→ρ′|a) (E_ρ(C|ρ→ρ)-E_ρ(C|ρ→ρ′)),

where P_ρ is the probability estimate corresponding to ρ; the probability term can be rewritten as E_ρ(I_ρ→ρ′) for I_ρ→ρ′ the indicator function for ρ→ρ′. In fact the whole line above can be rewritten as

E_ρ(u|a) + E_ρ(E_ρ(C|ρ→ρ)-E_ρ(C|ρ→ρ′) | a).

For this to work, E_ρ needs to be able to say sensible things about itself, and also about E_ρ′, which is used to estimate C if ρ→ρ′.

If we compare this with various ways of factoring out variables, we can see that it’s a case where we have a clear default, ρ, and are estimating deviations from that.