Corrigibility thoughts I: caring about multiple things

This is the first of three ar­ti­cles about limi­ta­tions and challenges in the con­cept of cor­rigi­bil­ity (see ar­ti­cles 2 and 3).

The desider­ata for cor­rigi­bil­ity are:

  1. A cor­rigible agent tol­er­ates, and prefer­ably as­sists, its op­er­a­tors in their at­tempts to al­ter or shut down the agent.

  2. A cor­rigible agent does not at­tempt to ma­nipu­late or de­ceive its op­er­a­tors.

  3. A cor­rigible agent has in­cen­tives to re­pair safety mea­sures (such as shut­down but­tons, trip­wires, or con­tain­ment tools) if they break, or at least no­tify its op­er­a­tors in the event of a break­age.

  4. A cor­rigible agent pre­serves its cor­rigi­bil­ity, even as it cre­ates new sub-sys­tems or sub-agents, even if it un­der­goes sig­nifi­cant self-mod­ifi­ca­tion.

In this post, I’ll be look­ing more at point 4. A sum­mary of the re­sult will be:

Un­less giv­ing the AI ex­tra op­tions can re­duce ex­pected util­ity, the AI must care about ev­ery pos­si­ble util­ity at least a bit.

Some of the re­sults are for­mal, but the bound­aries of the model are very un­clear, so the warn­ing in this post should always be born in mind.

Note that the in­differ­ence agents fail to be fully cor­rigible (they don’t cre­ate cor­rigible sub­agents) and they also don’t care about the other pos­si­ble util­ities be­fore be­ing changed (as this is a point of in­differ­ence).

Agents ver­sus non-agents

First I’ll pre­sent a cleaner ver­sion of an old ar­gu­ment. Ba­si­cally, it seems that defin­ing what a sub-agent or sub-sys­tem is, is tricky or im­pos­si­ble.

Con­sider for in­stance a pa­per­clip max­imiser, that may get cor­riged into a sta­ple max­imiser at a later date. The AI has some in­come, and may buy a large pro­por­tion of shares in ei­ther Gen­eral Paper­clips (GP) or Sta­ples United (SU). As­sume the best way of pro­mot­ing the use of one of the tools, it to take over the com­pany that makes them.

There are two sce­nar­ios; in each sce­nario, the AI has one of two choices.

  • In sce­nario 1, the AI has choice A: it buys stocks in GP, but can­not make fur­ther trades (in par­tic­u­lar, it can’t sell its stocks and buy oth­ers). It also has choice B: it re­tains flex­i­bil­ity, and can sell its GP stocks at some later date to buy stocks in SU.

  • In sce­nario 2, the AI del­e­gates its in­vest­ment de­ci­sions to a sub­agent. Un­der choice A, the sub­agent is a pure pa­per­clip max­imiser. Un­der choice B, the sub­agent is cor­rigible, and can be changed into a sta­ple max­imiser.

Clearly 1A and 2A are equiv­a­lent, while 1B and 2B are also equiv­a­lent. This trick can be ex­tended: in most toy mod­els, it’s pos­si­ble to re­place a set of op­tions with a choice of sub-agents, and vice-versa.

Thus point 4, in the real world, can­not be merely about sub­agents and sub­sys­tems, but about main­tain­ing flex­i­bil­ity for its fu­ture de­ci­sions.

Burn the (rest of) the world

Let’s now con­sider giv­ing the cor­rigible agent an op­tion C: it gets slightly more shares in GP than oth­er­wise. But, in ex­change, SU is de­stroyed, along with most of the rest of the world. This is the high­est op­tion for pa­per­clip max­i­mal­i­sa­tion, but the ab­solutely worst op­tion for sta­ples and any­thing else.

Also as­sume that the AI’s op­er­a­tors are vir­tu­ally cer­tain (99.9999...9% sure) to be chang­ing it into a sta­ple max­imiser. This means that tak­ing op­tion C is dra­mat­i­cally worse that op­tion B, from the per­spec­tive of ex­pected util­ity.

But the AI will only pre­fer op­tion B if it “cares” about sta­ples, at least a lit­tle bit, from the very be­gin­ning (we’ve shown that there is no clear dis­tinc­tion be­tween sub­agent and gen­eral ac­tions, thus the agent has to have a gen­eral care for sta­ples, rather than-only-car­ing-about-B-in-the-con­text-of-sub­agents).

Thus, if we want to avoid cor­rigible agents get­ting worse ex­pected out­comes when we give them ex­tra “op­tion Cs”, or if we just want to rule out “op­tion Cs” in the first place, then the AI needs to care about sta­ples (or any pos­si­ble fu­ture util­ity), from the very be­gin­ning.

Main­tain­ing flexibility

You could ar­gue that what’s im­por­tant about the agent is main­tain­ing flex­i­bil­ity, and that op­tion B is flex­ible, while op­tions A and C are not. Thus the agent doesn’t need to care about sta­ples in ad­vance, just to have a gen­eral prefer­ence for flex­i­bil­ity.

That doesn’t seem suffi­cient. Let’s add an­other, flex­ible, op­tion D, and put num­bers on the whole thing. If Up is the pa­per­clip max­imis­ing util­ity and Us is the sta­ple max­imis­ing one, then:

  • Op­tion A: Up=100, Us=0

  • Op­tion B: Up=100 or 0, Us=0 or 100

  • Op­tion C: Up=101, Us=-1,000,000

  • Op­tion D: Up=101 or 0, Us=-1,000,000 or −999,000

The “or” de­notes op­tions the AI has kept open. Here D is the most flex­ible op­tion, al­low­ing the great­est range of out­comes in terms of value. Yet it’s clearly a dis­aster for ex­pected util­ity if changes from Up to Us are likely. If we want to rule out the AI choos­ing D, then we need it to ac­tively value Us at least a bit, not just have a de­sire for flex­i­bil­ity.