Corrigibility as Constrained Optimisation

This post is coau­thored with Ryan Carey.

Much of the work on de­vel­op­ing a cor­rigible agent has fo­cused on en­sur­ing that an AI will not ma­nipu­late the shut­down but­ton or any other kind of de­vice that the hu­man op­er­a­tor would use to con­trol it. Sup­pose, how­ever, that the AI lacked any ca­pac­ity to press its shut­down but­ton, or to in­di­rectly con­trol the state of the shut­down but­ton. Then, what would re­main of the prob­lem of cor­rigi­bil­ity? We think there would be three crite­ria that would re­main. Be­fore stat­ing them, let be what­ever util­ity func­tion we would like the agent to op­ti­mise nor­mally, and let be the shut­down util­ity func­tion. ( may be mis-speci­fied.) The shut­down util­ity func­tion can be defined as when shut­down does not oc­cur and as for some when shut­down does oc­cur. For this post, we give the three crite­ria the fol­low­ing names and de­scrip­tions:

Shut­down in­cen­tive. The agent is in­cen­tivised to shut down if in­structed to do so. That is, af­ter the but­ton is pressed, it favours poli­cies that score more highly on .

Shut­down abil­ity. The agent is able (has the re­sources) to shut down if in­structed to do so. That is, After the but­ton is pressed, there ex­ists a policy that scores highly on .

Op­ti­mi­sa­tion in­cen­tive. The agent is in­cen­tivised to op­ti­mise the de­fault value func­tion (some func­tion that takes as one of its ar­gu­ments) oth­er­wise. That is, if the but­ton is not pressed, it favours poli­cies that score more highly on .*

The pur­pose of this post is to show that the de­fault value func­tion in the op­ti­mi­sa­tion in­cen­tive has to be con­strained some­how to make sure that the shut­down abil­ity prop­erty is achiev­able. If we want to be ro­bust against er­rors in the agent (in­clud­ing mis­speci­fi­ca­tion of ), then we need to make sure that shut­down abil­ity holds. To make sure of that, the best ap­proach we can see is to first satis­fice for , and then use the re­main­ing re­sources to op­ti­mise . In or­der to be safe, the agent must do so even if there are no re­main­ing re­sources to op­ti­mise . This can ei­ther be achieved di­rectly, by con­strained multi-ob­jec­tive op­ti­mi­sa­tion, or in­di­rectly, by bound­ing the util­ity func­tion. We fo­cus on the former.

Let us illus­trate the prob­lem by us­ing causal coun­ter­fac­tual meth­ods as baseline. (There are other is­sues with these meth­ods as well, but we will ig­nore them for the mo­ment.) Let be the event that the shut­down but­ton is pressed, and let oth­er­wise. The kind of causal coun­ter­fac­tual agent we are con­sid­er­ing uses the policy given by

In­for­mally speak­ing, the is­sue with weighted sums such as these is that for any given , there could be a much larger such that the former is ig­nored.

Let us for­mal­ise this for a sim­pler set­ting. In­stead of poli­cies, let us con­sider ac­tions . Then, we ex­e­cute the ac­tion given by

The setup is as fol­lows: (1) The agent ex­e­cutes an ac­tion in . (2) The op­er­a­tor may or may not press the but­ton . (3) The agent ob­serves whether or . (4) It may or may not be pos­si­ble for the agent to shut down. More gen­er­ally, it may be more or less prob­a­ble that shut­down will go through.

Sup­pose that the util­ity func­tions are un­bounded. See the ap­pendix for a treat­ment of bounded util­ity func­tions. Let be an ac­tion that is very bad from the point of view of pre­serv­ing shut­down abil­ity and let . We choose such that is equal to . Let . It may be the case that is poorly de­signed (or has some er­ror) such that is much larger than . Worse still, may be larger than for any ac­tion . From this rea­son­ing, we con­clude that there can­not ex­ist a that pre­serves shut­down abil­ity ro­bustly to mis-speci­fi­ca­tion of . (There may or may not ex­ist such a , but we have no guaran­tee that it does not ex­ist.) At least, it is im­pos­si­ble for un­bounded , where we are us­ing a weighted sum of ex­pec­ta­tions. For bounded , cf. the ap­pendix. Hence­forth, we fo­cus on un­bounded .

To think of al­ter­na­tives for the causal coun­ter­fac­tual method, it helps to think of the prob­lem as one of multi-ob­jec­tive op­ti­mi­sa­tion. Con­sider the figure be­low.

We want an agent that pre­pares for the even­tu­al­ity of but­ton-press in­so­far as nec­es­sary, but that al­lo­cates re­main­ing re­sources to pur­su­ing its or­di­nary ob­jec­tive . The first step of the con­strained multi-ob­jec­tive op­ti­mi­sa­tion al­gorithm would be to re­move all ac­tions be­low the thresh­old from the ac­tion set . Let be the new set. In the figure, . The sec­ond step of the al­gorithm would be to find the op­ti­mal ac­tion . (The differ­ence to the case of bounded util­ity func­tions is that we would still max­imise the [weighted] sum.)

In this frame­work, it is clear that we could in the­ory find a such that the agent is satis­fac­to­rily prob­a­bly to shut down in step (4). For ex­am­ple, as gets closer to , the agent can only shut down. As for the op­ti­mi­sa­tion in­cen­tive prop­erty, we have to ex­change it for:

Ro­bust op­ti­mi­sa­tion in­cen­tive. The agent is in­cen­tivised to op­ti­mise the de­fault value func­tion oth­er­wise. That is, if the but­ton is not pressed, it favours poli­cies that score more highly on —the nor­mal util­ity func­tion—given ei­ther of the fol­low­ing con­straints: The de­fault value func­tion is re­quired to be (a) a con­strained multi-ob­jec­tive op­ti­mi­sa­tion of and as de­scribed above or (b) a weighted sum of ex­pec­ta­tions of and , where they are bounded util­ity func­tions as de­scribed in the ap­pendix.

Given the ro­bust op­ti­mi­sa­tion in­cen­tive prop­erty, it is clear that the agent may score very poorly on in cer­tain en­vi­ron­ments. For ex­am­ple, it there always is a trade-off be­tween and , the agent will always favour

Here, we have con­sid­ered hav­ing the agent change its util­ity from to a shut­down util­ity func­tion . But in­stead, could we have it change its util­ity from to an up­dated (non-shut­down) util­ity func­tion ? If prepar­ing to max­imise the new util­ity func­tion re­quires an un­bounded amount of re­sources, then a sys­tem will, by max­imis­ing , perform poorly on . So there is a rea­son to have the new func­tion cor­re­spond to a task that re­quires bounded prepa­ra­tion, like shut­ting down, rather than a task for which prepa­ra­tion could re­quire un­limited re­sources, like pro­duc­ing new goods, or do­ing new in­tel­lec­tual work.

In sum­mary, we have pro­posed a con­strained op­ti­mi­sa­tion causal coun­ter­fac­tual method, that con­sists of max­imis­ing sub­ject to for any ac­tion . There are sev­eral di­rec­tions for fu­ture work. One task would be to see whether our con­strained op­ti­mi­sa­tion var­i­ant over­comes any of the other prob­lems with the or­di­nary causal coun­ter­fac­tual cor­rigible agent, like crazy bets.Here, we con­sid­ered the case where a sin­gle ac­tion is ex­e­cuted, but, of course, this could be ex­tended to more com­plex sce­nar­ios.


* In terms of the four crite­ria on page 2 of Cor­rigi­bil­ity by Soares et al (2015), crite­rion (1) cor­re­sponds to shut­down in­cen­tive. crite­rion (2), non-ma­nipu­la­tion, is as­sumed solved. crite­ria (3), ten­dency to re­pair safety mea­sures, and (4), cor­rigi­bil­ity of sub­agents, cor­re­spond to shut­down abil­ity.


Here, we show that one solu­tion to the prob­lem is to bound the util­ity func­tion.

We start by de­scribing the set­ting. The agent can choose from sev­eral ac­tions, and each ac­tion can be as­signed a prob­a­bil­ity to lead to a shut­down. Let be the event of a shut­down and let be the set of ac­tions. Fur­ther­more, let it be the case that and . That is, the shut­down util­ity func­tion is not bounded, while the nor­mal util­ity func­tion is. Speci­fi­cally, we let and define as if a shut­down hap­pens and oth­er­wise. (Note that this is differ­ent from the main post.) Define and .

We move on to con­sider what op­ti­mal ac­tions ? We start by find­ing the fol­low­ing in­differ­ence:

Sup­pose that is max­i­mally good ac­cord­ing to , i.e., . Then, the above in­equal­ity is

Define . Then . This gives us the inequality

Let be the lower thresh­old on how prob­a­ble you want the op­ti­mal ac­tion to lead to a shut­down. Then it is always pos­si­ble to find a satis­fy­ing ac­cord­ing to

This proves that bound­ing the util­ity func­tion is a method for satis­fy­ing shut­down abil­ity.