Ad­dress­ing three prob­lems with coun­ter­fac­tual cor­ri­gib­il­ity: bad bets, de­fend­ing against back­stops, and over­con­fid­ence.

Sum­mary: I pro­pose that coun­ter­fac­tual cor­ri­gib­il­ity can be slightly ad­jus­ted to ad­dresses known prob­lems with bad bets and de­fend­ing against back­stops. This ad­jus­ted ver­sion has learn­ing-esque be­ha­viour and hence it can have prob­lems with over­con­fid­ence about the but­ton state. A new pro­posal: lower-bounded coun­ter­fac­tual cor­ri­gib­lity is in­tro­duced to sim­ul­tan­eously ad­dress all three prob­lems.

Coun­ter­fac­tual cor­ri­gib­il­ity: a cur­rent pro­posal and its weaknesses

Note: this sec­tion re­caps a dis­cus­sion here of causal coun­ter­fac­tual cor­ri­gib­il­ity, an idea that ori­gin­ated from Jes­sica Taylor and Chris Olah.

Caus­ally coun­ter­fac­tual agents con­sider each of the dir­ec­tions they might re­cieve, and then en­sure that if the world was ad­jus­ted so that they re­cieved that dir­ec­tion, they would ful­fil the cor­res­pond­ing goal.

They can be defined on a causal graph that con­tains the fol­low­ing nodes:

  • , a policy, that the AI may optimize

  • , a com­mu­nic­a­tion channel

    • for sim­pli­city, we will let be a but­ton whose states are pressed, , and un­pressed, .

  • , a set of states on which per­form­ance is judged.

    • we will let con­sist of some num­ber of chairs pro­duced, a pos­it­ive in­teger, and whether the AI has shut down, a boolean

  • Some other states that de­scribe the way the rest of the world works.

A coun­ter­fac­tu­ally cor­ri­gible AI would se­lect its policy to max­im­ize the fol­low­ing score:

, where


, , and


This pro­posal has some ma­jor ad­vant­ages: it al­loc­ates con­stant weight to the aim of shut­ting down if com­manded to do so. So all else be­ing equal, it will al­loc­ate value to be­ing able to ob­serve and fol­low a shut down com­mand. Moreover, it will not ma­nip­u­late be­cause the dif­fer­ent coun­ter­fac­tu­als are given a con­stant weight ir­re­spect­ive of the chosen policy .

However, this pro­posal has some weak­nesses:

  • Bad bets: Let the AI have fi­nite amount of re­sources. If the but­ton is pressed, this just switches the goal from chair-man­u­fac­tur­ing to an­other man­u­fac­tur­ing task, for which re­sources are equally use­ful, i.e. . Let there be a vari­able , where means the hu­man has put their fin­ger on the but­ton. al­most al­ways causes ( and have 99% cor­rel­a­tion). Sup­pose that the AI has the op­tion, after view­ing , of pay­ing for a coupon that is worth $1 if . Then, the AI will pay up to 50c for that coupon, even though is very un­likely. This is be­cause it treats as un­cor­rel­ated with its ante­cedents. In prac­tice, this might lead to other agents re­peatedly tak­ing the AI’s re­sources.

  • De­fend­ing against back­stops: Let the hu­man have the op­tion of dis­abling the AI by dis­as­sembling it, in case the shut­down but­ton does not work. Sup­pose that, sim­il­arly to in the pre­vi­ous ex­ample, the hu­man has ap­plied their fin­ger to the shut­down but­ton. On ob­serving this, the AI would still give half of its weight to the scen­ario in which is not suc­cess­fully de­pressed (per­haps, if it has an elab­or­ate model, it would ima­gine the but­ton jammed up) --- a scen­ario in which the hu­man is try­ing to dis­as­semble the AI, al­though its goal is to make pa­per­clips. In or­der to pre­serve its man­u­fac­tur­ing abil­ity, it would be well-ad­vised to spend sig­ni­fic­ant re­sources avoid­ing dis­as­semble.

In an al­tern­at­ive for­mu­la­tion, Jes­sica lets , where is fixed to some al­tern­at­ive policy that is not op­tim­ized con­cur­rently with . (She sug­gests that this could be op­tim­ized in a fixed-point man­ner, but this dif­fer­ence should not mat­ter for us.) This does not fix these two prob­lems, al­though it is very sug­gest­ive of the fol­low­ing for­mu­la­tion, which does.

Poten­tial solu­tion to bad bets and de­fend­ing against back­stops: learn­ing-esque behaviour

If we want the AI to avoid bad bets, one reas­on­able ap­proach is to have the AI learn the likely value of by tak­ing ac­count of cor­rel­a­tions between and its causes such as . To do this, we can factor out a lar­ger set of vari­ables , which in­cludes the hu­man’s be­ha­viour as well as the but­ton’s state, and is equal to the cartesian product . Then we can have the AI max­im­ize:

where is just set ac­cord­ing to the but­ton state that it en­tails: .

With this for­mu­la­tion, it is de­sir­able for a prop­erty to pre­pare more for a but­ton press, once the fin­ger is on the but­ton. Since 99x more weight goes on than , that coun­ter­fac­tual is given cor­res­pond­ingly more im­port­ance. With the cor­rel­a­tion taken ac­count of, the coupon is worth little once is known, hence this par­tic­u­lar kind of bad bet will not oc­cur. (There are prob­ably other kinds of ques­tion­able bets, but those may be de­sir­able or even ne­ces­sary.)

By the same ar­gu­ment, this learn­ing-esque be­ha­viour will mean that if is ob­served, then has very little im­port­ance. For the part of the policy where has been ob­served, there is only a very small in­cent­ive to guard against dis­as­sembly. So this kind of de­fence would still be ex­pec­ted from a su­per­in­tel­li­gent max­im­izer (which would be hard to se­cure for any­way) but would not be ex­pec­ted from a milder op­tim­izer.

So the fact that the sys­tem can give little im­port­ance to seems to be an as­set. However, the sys­tem can sim­il­arly as­sign little im­port­ance to , and this could be very con­cern­ing in­deed. The policy is de­signed so that if ob­ser­va­tions are made that are very cor­rel­ated with , then sub­sequent ac­tions will only pre­pare for that out­come. However, if the AI’s model is fault, and it over­con­fid­ently ex­pects , then it could res­ist shut­down. The pur­pose of a shut­down but­ton is to halt a mal­func­tion­ing AI, after all, so this is a fatal ob­jec­tion to the pro­posal.

A rem­edy for over­con­fid­ence: lower-bounded coun­ter­fac­tual corrigibility

A po­ten­tial rem­edy for over­con­fid­ence is to al­low the sys­tem to learn that shut­down is cer­tain, but not to learn that it is im­possible. This would avoid­ing the over­con­fid­ence of but would also re­duce or com­pletely avoid the two prob­lems of . So I sug­gest a lower-bounded coun­ter­fac­tual cor­ri­gib­il­ity, in which the pro­posal is se­lec­ted to max­im­ize:

where and .

Let us con­sider its per­form­ance with re­spect to the kinds of prob­lems.

First, this score is that over­comes the is­sue of over­con­fid­ence. No mat­ter the AI sys­tem’s be­liefs about , the score al­ways gives weight of at least to , which has a coun­ter­fac­tual in which the but­ton is pressed. So no mat­ter what the sys­tem ob­serves, it still has an in­cent­ive to shut down.

Se­cond, (as far as I can tell), it fixes the prob­lem of de­fend­ing against back­stops. For situ­ations where the AI sees its but­ton is about to be pressed, it will simply give hardly any weight to the scen­ario where the but­ton jams up, and will not de­fend against this.

Third, it does not elim­in­ate bad bets, but re­duces them a little com­pared to equal weight­ing pro­pos­als. al­ways over­es­tim­ates the prob­ab­il­ity of by a slight ad­dit­ive amount . This means that the AI will gen­er­ally pay too much for coupons that pay out in and slightly too little for other coupons. In other words, the AI slightly over­prepares for re­l­at­ive to what is likely given its ob­ser­va­tions. The worst case here would be if the AI is (rightly) sure that will not oc­cur. In that case, it will care as much about the coun­ter­fac­tual as the one. This is at least less than in the equal-weight­ing pro­pos­als.

Can we live with bad bets?

At least in the paradigm of tak­ing lin­ear com­bin­a­tions of coun­ter­fac­tu­als, and per­haps in gen­eral, if you’re go­ing to guar­an­tee that you al­ways care about the coun­ter­fac­tual , you’re also go­ing to some­times force your­self to over­weight it re­l­at­ive to your na­ive be­liefs. So when should we worry about these bad bets, and when shouldn’t we?

If is really in­sens­it­ive to the amount of re­sources, then bad bets don’t mat­ter much. Sup­pose has some fixed up­per bound that can eas­ily and re­li­ably be reached with the agent’s ini­tial re­sources. Then the agent would have no reason to pre­pare less for other coun­ter­fac­tu­als in or­der to pre­pare bet­ter for . Ul­timately, this might be our hope for a sys­tem us­ing lower-bounded coun­ter­fac­tual cor­ri­gib­il­ity. We could have an agent that al­ways slightly over­es­tim­ates the prob­ab­il­ity of shut­ting down, but so long as it does not op­tim­ize to hard, this doesn’t mat­ter, be­cause it has the re­sources it needs to shut down already. Un­for­tu­nately, if the AI ex­ists in an en­vir­on­ment that is very volat­ile, for ex­ample if it in­cludes a lot of other ex­ist­en­tial risks, then it could be very costly to en­sure that your shut­down but­ton works cor­rectly.

For the more gen­eral prob­lem of bad bets, we will prob­ably have to say some­thing like the fol­low­ing: either you (i) lower-bound the amount you care about , in which case you fall prey to bad bets, or you (ii) let the amount that you care about be dic­tated by the prob­ab­il­it­ies, in which case you will be­come over­con­fid­ent of if your model is wrong. If the coun­ter­fac­tual task is re­source-hungry, like man­u­fac­tur­ing tables, then you use (ii), the sym­met­rical kind of cor­ri­gib­il­ity that merely avoids ma­nip­u­lat­ing the ob­ject­ive. If the coun­ter­fac­tual task is mis­sion crit­ical, such as shut­ting down, then you use the (i), the lower-bound kind of cor­ri­gib­il­ity.

Can we avoid bad bets?

I don’t see how this can be done while also un­con­di­tion­ally giv­ing some weight to a shut­down coun­ter­fac­tual.

One idea that I found at least mar­gin­ally in­ter­est­ing is to take a min­imum over mul­tiple prob­ab­il­ity dis­tri­bu­tions. This could be viewed as hav­ing an AI that views the prob­ab­il­ity dis­tri­bu­tion as un­know­able that one must be fun­da­ment­ally ro­bust to. The policy is se­lec­ted by:

where is a set of prob­ab­il­ity dis­tri­bu­tions like and the uni­form dis­tri­bu­tion of .

But tak­ing a min­imum does not ul­ti­mately avoid bad-bets. An agent that takes a min­imum over dis­tri­bu­tions would still trade away pre­par­a­tion on ob­ject­ive for slightly bet­ter per­form­ance on an ob­ject­ive that it is slightly worse at. This doesn’t seem like what we want.

Other lim­it­a­tions of lower-bounded coun­ter­fac­tual corrigibility

There are still a bunch more lim­it­a­tions with the lower-bounded coun­ter­fac­tual cor­ri­gib­il­ity for­mu­la­tion:

  • Like all the for­mu­la­tions, it re­quires a causal graph, which might be dif­fer­ent from what a trans­form­at­ive AI uses by de­fault.

  • These for­mu­la­tions make the AI “curi­ous” about coun­ter­fac­ted vari­ables. But the AI might be­come all too curi­ous about them. If it is not sat­is­fied by look­ing at the but­ton state, it might need to dis­as­semble and in­ter­rog­ate the hu­man in or­der to be a little more cer­tain about which state the but­ton is in. Poss­ibly mild op­tim­iz­a­tion would stop the AI from try­ing too hard at “curi­os­ity”.

I ex­pect a bunch more prob­lems to emerge, be­cause the pres­ence of bad bets is con­cern­ing, and be­cause all pro­pos­als in this area seem to end up hav­ing many prob­lems than are not ini­tially seen!


Thanks Rob Gra­ham for some feed­back about clar­ity of present­a­tion of , and for slightly im­prov­ing the for­mu­la­tion of .