# Corrigibility as Constrained Optimisation

This post is coau­thored with Ryan Carey.

Much of the work on de­vel­op­ing a cor­rigible agent has fo­cused on en­sur­ing that an AI will not ma­nipu­late the shut­down but­ton or any other kind of de­vice that the hu­man op­er­a­tor would use to con­trol it. Sup­pose, how­ever, that the AI lacked any ca­pac­ity to press its shut­down but­ton, or to in­di­rectly con­trol the state of the shut­down but­ton. Then, what would re­main of the prob­lem of cor­rigi­bil­ity? We think there would be three crite­ria that would re­main. Be­fore stat­ing them, let be what­ever util­ity func­tion we would like the agent to op­ti­mise nor­mally, and let be the shut­down util­ity func­tion. ( may be mis-speci­fied.) The shut­down util­ity func­tion can be defined as when shut­down does not oc­cur and as for some when shut­down does oc­cur. For this post, we give the three crite­ria the fol­low­ing names and de­scrip­tions:

Shut­down in­cen­tive. The agent is in­cen­tivised to shut down if in­structed to do so. That is, af­ter the but­ton is pressed, it favours poli­cies that score more highly on .

Shut­down abil­ity. The agent is able (has the re­sources) to shut down if in­structed to do so. That is, After the but­ton is pressed, there ex­ists a policy that scores highly on .

Op­ti­mi­sa­tion in­cen­tive. The agent is in­cen­tivised to op­ti­mise the de­fault value func­tion (some func­tion that takes as one of its ar­gu­ments) oth­er­wise. That is, if the but­ton is not pressed, it favours poli­cies that score more highly on .*

The pur­pose of this post is to show that the de­fault value func­tion in the op­ti­mi­sa­tion in­cen­tive has to be con­strained some­how to make sure that the shut­down abil­ity prop­erty is achiev­able. If we want to be ro­bust against er­rors in the agent (in­clud­ing mis­speci­fi­ca­tion of ), then we need to make sure that shut­down abil­ity holds. To make sure of that, the best ap­proach we can see is to first satis­fice for , and then use the re­main­ing re­sources to op­ti­mise . In or­der to be safe, the agent must do so even if there are no re­main­ing re­sources to op­ti­mise . This can ei­ther be achieved di­rectly, by con­strained multi-ob­jec­tive op­ti­mi­sa­tion, or in­di­rectly, by bound­ing the util­ity func­tion. We fo­cus on the former.

Let us illus­trate the prob­lem by us­ing causal coun­ter­fac­tual meth­ods as baseline. (There are other is­sues with these meth­ods as well, but we will ig­nore them for the mo­ment.) Let be the event that the shut­down but­ton is pressed, and let oth­er­wise. The kind of causal coun­ter­fac­tual agent we are con­sid­er­ing uses the policy given by

In­for­mally speak­ing, the is­sue with weighted sums such as these is that for any given , there could be a much larger such that the former is ig­nored.

Let us for­mal­ise this for a sim­pler set­ting. In­stead of poli­cies, let us con­sider ac­tions . Then, we ex­e­cute the ac­tion given by

The setup is as fol­lows: (1) The agent ex­e­cutes an ac­tion in . (2) The op­er­a­tor may or may not press the but­ton . (3) The agent ob­serves whether or . (4) It may or may not be pos­si­ble for the agent to shut down. More gen­er­ally, it may be more or less prob­a­ble that shut­down will go through.

Sup­pose that the util­ity func­tions are un­bounded. See the ap­pendix for a treat­ment of bounded util­ity func­tions. Let be an ac­tion that is very bad from the point of view of pre­serv­ing shut­down abil­ity and let . We choose such that is equal to . Let . It may be the case that is poorly de­signed (or has some er­ror) such that is much larger than . Worse still, may be larger than for any ac­tion . From this rea­son­ing, we con­clude that there can­not ex­ist a that pre­serves shut­down abil­ity ro­bustly to mis-speci­fi­ca­tion of . (There may or may not ex­ist such a , but we have no guaran­tee that it does not ex­ist.) At least, it is im­pos­si­ble for un­bounded , where we are us­ing a weighted sum of ex­pec­ta­tions. For bounded , cf. the ap­pendix. Hence­forth, we fo­cus on un­bounded .

To think of al­ter­na­tives for the causal coun­ter­fac­tual method, it helps to think of the prob­lem as one of multi-ob­jec­tive op­ti­mi­sa­tion. Con­sider the figure be­low.

We want an agent that pre­pares for the even­tu­al­ity of but­ton-press in­so­far as nec­es­sary, but that al­lo­cates re­main­ing re­sources to pur­su­ing its or­di­nary ob­jec­tive . The first step of the con­strained multi-ob­jec­tive op­ti­mi­sa­tion al­gorithm would be to re­move all ac­tions be­low the thresh­old from the ac­tion set . Let be the new set. In the figure, . The sec­ond step of the al­gorithm would be to find the op­ti­mal ac­tion . (The differ­ence to the case of bounded util­ity func­tions is that we would still max­imise the [weighted] sum.)

In this frame­work, it is clear that we could in the­ory find a such that the agent is satis­fac­to­rily prob­a­bly to shut down in step (4). For ex­am­ple, as gets closer to , the agent can only shut down. As for the op­ti­mi­sa­tion in­cen­tive prop­erty, we have to ex­change it for:

Ro­bust op­ti­mi­sa­tion in­cen­tive. The agent is in­cen­tivised to op­ti­mise the de­fault value func­tion oth­er­wise. That is, if the but­ton is not pressed, it favours poli­cies that score more highly on —the nor­mal util­ity func­tion—given ei­ther of the fol­low­ing con­straints: The de­fault value func­tion is re­quired to be (a) a con­strained multi-ob­jec­tive op­ti­mi­sa­tion of and as de­scribed above or (b) a weighted sum of ex­pec­ta­tions of and , where they are bounded util­ity func­tions as de­scribed in the ap­pendix.

Given the ro­bust op­ti­mi­sa­tion in­cen­tive prop­erty, it is clear that the agent may score very poorly on in cer­tain en­vi­ron­ments. For ex­am­ple, it there always is a trade-off be­tween and , the agent will always favour

Here, we have con­sid­ered hav­ing the agent change its util­ity from to a shut­down util­ity func­tion . But in­stead, could we have it change its util­ity from to an up­dated (non-shut­down) util­ity func­tion ? If prepar­ing to max­imise the new util­ity func­tion re­quires an un­bounded amount of re­sources, then a sys­tem will, by max­imis­ing , perform poorly on . So there is a rea­son to have the new func­tion cor­re­spond to a task that re­quires bounded prepa­ra­tion, like shut­ting down, rather than a task for which prepa­ra­tion could re­quire un­limited re­sources, like pro­duc­ing new goods, or do­ing new in­tel­lec­tual work.

In sum­mary, we have pro­posed a con­strained op­ti­mi­sa­tion causal coun­ter­fac­tual method, that con­sists of max­imis­ing sub­ject to for any ac­tion . There are sev­eral di­rec­tions for fu­ture work. One task would be to see whether our con­strained op­ti­mi­sa­tion var­i­ant over­comes any of the other prob­lems with the or­di­nary causal coun­ter­fac­tual cor­rigible agent, like crazy bets.Here, we con­sid­ered the case where a sin­gle ac­tion is ex­e­cuted, but, of course, this could be ex­tended to more com­plex sce­nar­ios.

——————————

* In terms of the four crite­ria on page 2 of Cor­rigi­bil­ity by Soares et al (2015), crite­rion (1) cor­re­sponds to shut­down in­cen­tive. crite­rion (2), non-ma­nipu­la­tion, is as­sumed solved. crite­ria (3), ten­dency to re­pair safety mea­sures, and (4), cor­rigi­bil­ity of sub­agents, cor­re­spond to shut­down abil­ity.

# Appendix

Here, we show that one solu­tion to the prob­lem is to bound the util­ity func­tion.

We start by de­scribing the set­ting. The agent can choose from sev­eral ac­tions, and each ac­tion can be as­signed a prob­a­bil­ity to lead to a shut­down. Let be the event of a shut­down and let be the set of ac­tions. Fur­ther­more, let it be the case that and . That is, the shut­down util­ity func­tion is not bounded, while the nor­mal util­ity func­tion is. Speci­fi­cally, we let and define as if a shut­down hap­pens and oth­er­wise. (Note that this is differ­ent from the main post.) Define and .

We move on to con­sider what op­ti­mal ac­tions ? We start by find­ing the fol­low­ing in­differ­ence:

Sup­pose that is max­i­mally good ac­cord­ing to , i.e., . Then, the above in­equal­ity is

Define . Then . This gives us the inequality

Let be the lower thresh­old on how prob­a­ble you want the op­ti­mal ac­tion to lead to a shut­down. Then it is always pos­si­ble to find a satis­fy­ing ac­cord­ing to

This proves that bound­ing the util­ity func­tion is a method for satis­fy­ing shut­down abil­ity.

• Lay­man ques­tions:

1. I don’t un­der­stand what you mean by “state” in “Sup­pose, how­ever, that the AI lacked any ca­pac­ity to press its shut­down but­ton, or to in­di­rectly con­trol its state”. Do you in­clude its util­ity func­tion in its state? Or just the ob­ser­va­tions he re­ceives from the en­vi­ron­ment? What con­text/​frame­work are you us­ing?

2. Could you define U_S and U_N? From the Cor­ri­bil­ity pa­per, U_S ap­pears to be an util­ity func­tion fa­vor­ing shut­down, and U_N is a po­ten­tially flawed util­ity func­tion, a first stab at spec­i­fy­ing their own goals. Was that what you meant? I think it’s use­ful to define it in the in­tro­duc­tion.

3. I don’t un­der­stand how an agent that “[lacks] any ca­pac­ity to press its shut­down but­ton” could have any shut­down abil­ity. It’s seems like a con­tra­dic­tion, un­less you mean “any ca­pac­ity to di­rectly press its shut­down but­ton”.

4. What’s the “de­fault value func­tion” and the “nor­mal util­ity func­tion” in “Op­ti­mi­sa­tion in­cen­tive”? Is it clearly defined in the lit­ter­a­ture?

5. “Worse still… for any ac­tion...” → if you choose b as some ac­tion with bad cor­rigi­bil­ity prop­erty, it seems rea­son­able that it can be bet­ter than most ac­tions on v_N + v_S (for in­stance if b is the argmax). I don’t see how that’s a “worse still” sce­nario, it seems plau­si­ble and nor­mal.

6. “From this rea­son­ing, we con­clude” → are you in­fer­ing things from some hy­po­thetic b that would satisfy all the things you men­tion? If that’s the case, I would need an ex­am­ple to see that it’s in­deed pos­si­ble. Even bet­ter would be a proof that you can always find such b.

7. “it is clear that we could in the­ory find a θ” → could you ex­pand on this?

8. “Given the ro­bust op­ti­mi­sa­tion in­cen­tive prop­erty, it is clear that the agent may score very poorly on UN in cer­tain en­vi­ron­ments.” → again, can you ex­pand on why it’s clear?

9. In the ap­pendix, in your 4 lines in­equal­ity, do you as­sume that U_N(a_s) is non-nega­tive (from line 2 to 3)? If yes, why?

• Thank you so much for your com­ments, Michaël! The post has been up­dated on most of them. Here are some more spe­cific replies.

1. I don’t un­der­stand what you mean by “state” in “Sup­pose, how­ever, that the AI lacked any ca­pac­ity to press its shut­down but­ton, or to in­di­rectly con­trol its state”. Do you in­clude its util­ity func­tion in its state? Or just the ob­ser­va­tions he re­ceives from the en­vi­ron­ment? What con­text/​frame­work are you us­ing?

Re­ply: “State” refers to the state of the but­ton, i.e., whether it is in an on state or an off state. It is now clar­ified.

2. Could you define U_S and U_N? From the Cor­ri­bil­ity pa­per, U_S ap­pears to be an util­ity func­tion fa­vor­ing shut­down, and U_N is a po­ten­tially flawed util­ity func­tion, a first stab at spec­i­fy­ing their own goals. Was that what you meant? I think it’s use­ful to define it in the in­tro­duc­tion.

Re­ply: U_{N} is as­sumed rather than defined, but it is now clar­ified.

3. I don’t un­der­stand how an agent that “[lacks] any ca­pac­ity to press its shut­down but­ton” could have any shut­down abil­ity. It’s seems like a con­tra­dic­tion, un­less you mean “any ca­pac­ity to di­rectly press its shut­down but­ton”.

Re­ply: The but­ton is a com­mu­ni­ca­tion link be­tween the op­er­a­tor and the agent. In gen­eral, it is pos­si­ble to con­struct an agent that shuts down even though it has re­ceived no such mes­sage from its op­er­a­tors as well as an agent that does get a shut­down mes­sage, but does not shut down. Shut­down is a state de­pen­dent on ac­tions, and not a com­mu­ni­ca­tion link. Hope­fully, this clar­ifies that they are un­cor­re­lated. I think it’s clear enough in the post already, but if you have some sug­ges­tion on how to clar­ify it even more, I’d gladly hear it!

4. What’s the “de­fault value func­tion” and the “nor­mal util­ity func­tion” in “Op­ti­mi­sa­tion in­cen­tive”? Is it clearly defined in the lit­ter­a­ture?

Re­ply: It is now clar­ified.

5. “Worse still… for any ac­tion...” → if you choose b as some ac­tion with bad cor­rigi­bil­ity prop­erty, it seems rea­son­able that it can be bet­ter than most ac­tions on v_N + v_S (for in­stance if b is the argmax). I don’t see how that’s a “worse still” sce­nario, it seems plau­si­ble and nor­mal.

Re­ply: The bad thing about this sce­nario is that U_{N} could be mis-speci­fied, yet shut­down would not be pos­si­ble. It can be both bad, nor­mal, and plau­si­ble. I’m not com­pletely sure what is the un­cer­tainty here.

6. “From this rea­son­ing, we con­clude” → are you in­fer­ing things from some hy­po­thetic b that would satisfy all the things you men­tion? If that’s the case, I would need an ex­am­ple to see that it’s in­deed pos­si­ble. Even bet­ter would be a proof that you can always find such b.

Re­ply: This is not what we try to show. It is pos­si­ble that there ex­ists no b that has all those prop­er­ties. The ques­tion is whether we can guaran­tee that there ex­ists no such b. The con­clu­sion, is that we can­not guaran­tee it. The con­clu­sion is not that there will always ex­ist such a b. This has been clar­ified now.

7. “it is clear that we could in the­ory find a θ” → could you ex­pand on this?

Re­ply: It has been clar­ified.

8. “Given the ro­bust op­ti­mi­sa­tion in­cen­tive prop­erty, it is clear that the agent may score very poorly on UN in cer­tain en­vi­ron­ments.” → again, can you ex­pand on why it’s clear?

Re­ply: It has been clar­ified.

9. In the ap­pendix, in your 4 lines in­equal­ity, do you as­sume that U_N(a_s) is non-nega­tive (from line 2 to 3)? If yes, why?

Re­ply: Yes, U_{N} is bounded in [0,1] as stated in the be­gin­ning of the ap­pendix. The choice of bounds should be ar­bi­trary.