Addressing three problems with counterfactual corrigibility: bad bets, defending against backstops, and overconfidence.

Sum­mary: I pro­pose that coun­ter­fac­tual cor­rigi­bil­ity can be slightly ad­justed to ad­dress known prob­lems with bad bets and defend­ing against back­stops. This ad­justed ver­sion has learn­ing-es­que be­havi­our and hence it can have prob­lems with over­con­fi­dence about the but­ton state. A new pro­posal: lower-bounded coun­ter­fac­tual cor­rigiblity is in­tro­duced to si­mul­ta­neously ad­dress all three prob­lems.

Coun­ter­fac­tual cor­rigi­bil­ity: a cur­rent pro­posal and its weaknesses

Note: this sec­tion re­caps a dis­cus­sion here of causal coun­ter­fac­tual cor­rigi­bil­ity, an idea that origi­nated from Jes­sica Tay­lor and Chris Olah.

Causally coun­ter­fac­tual agents con­sider each of the di­rec­tions they might re­cieve, and then en­sure that if the world was ad­justed so that they re­cieved that di­rec­tion, they would fulfil the cor­re­spond­ing goal.

They can be defined on a causal graph that con­tains the fol­low­ing nodes:

  • , a policy, that the AI may optimize

  • , a com­mu­ni­ca­tion channel

    • for sim­plic­ity, we will let be a but­ton whose states are pressed, , and un­pressed, .

  • , a set of states on which perfor­mance is judged.

    • we will let con­sist of some num­ber of chairs pro­duced, a pos­i­tive in­te­ger, and whether the AI has shut down, a boolean

  • Some other states that de­scribe the way the rest of the world works.

A coun­ter­fac­tu­ally cor­rigible AI would se­lect its policy to max­i­mize the fol­low­ing score:

, where


, , and


This pro­posal has some ma­jor ad­van­tages: it al­lo­cates con­stant weight to the aim of shut­ting down if com­manded to do so. So all else be­ing equal, it will al­lo­cate value to be­ing able to ob­serve and fol­low a shut down com­mand. More­over, it will not ma­nipu­late be­cause the differ­ent coun­ter­fac­tu­als are given a con­stant weight ir­re­spec­tive of the cho­sen policy .

How­ever, this pro­posal has some weak­nesses:

  • Bad bets: Let the AI have finite amount of re­sources. If the but­ton is pressed, this just switches the goal from chair-man­u­fac­tur­ing to an­other man­u­fac­tur­ing task, for which re­sources are equally use­ful, i.e. . Let there be a vari­able , where means the hu­man has put their finger on the but­ton. al­most always causes ( and have 99% cor­re­la­tion). Sup­pose that the AI has the op­tion, af­ter view­ing , of pay­ing for a coupon that is worth $1 if . Then, the AI will pay up to 50c for that coupon, even though is very un­likely. This is be­cause it treats as un­cor­re­lated with its an­tecedents. In prac­tice, this might lead to other agents re­peat­edly tak­ing the AI’s re­sources.

  • Defend­ing against back­stops: Let the hu­man have the op­tion of dis­abling the AI by dis­assem­bling it, in case the shut­down but­ton does not work. Sup­pose that, similarly to in the pre­vi­ous ex­am­ple, the hu­man has ap­plied their finger to the shut­down but­ton. On ob­serv­ing this, the AI would still give half of its weight to the sce­nario in which is not suc­cess­fully de­pressed (per­haps, if it has an elab­o­rate model, it would imag­ine the but­ton jammed up) --- a sce­nario in which the hu­man is try­ing to dis­assem­ble the AI, al­though its goal is to make pa­per­clips. In or­der to pre­serve its man­u­fac­tur­ing abil­ity, it would be well-ad­vised to spend sig­nifi­cant re­sources avoid­ing dis­assem­bly.

In an al­ter­na­tive for­mu­la­tion, Jes­sica lets , where is fixed to some al­ter­na­tive policy that is not op­ti­mized con­cur­rently with . (She sug­gests that this could be op­ti­mized in a fixed-point man­ner, but this differ­ence should not mat­ter for us.) This does not fix these two prob­lems, al­though it is very sug­ges­tive of the fol­low­ing for­mu­la­tion, which does.

Po­ten­tial solu­tion to bad bets and defend­ing against back­stops: learn­ing-es­que behaviour

If we want the AI to avoid bad bets, one rea­son­able ap­proach is to have the AI learn the likely value of by tak­ing ac­count of cor­re­la­tions be­tween and its causes such as . To do this, we can fac­tor out a larger set of vari­ables , which in­cludes the hu­man’s be­havi­our as well as the but­ton’s state, and is equal to the carte­sian product . Then we can have the AI max­i­mize:

where is just set ac­cord­ing to the but­ton state that it en­tails: .

With this for­mu­la­tion, it is de­sir­able for an AI to pre­pare more for a but­ton press, once the finger is on the but­ton. Since 99x more weight goes on than , that coun­ter­fac­tual is given cor­re­spond­ingly more im­por­tance. With the cor­re­la­tion taken ac­count of, the coupon is worth lit­tle once the finger is known to be on the but­ton (), hence this par­tic­u­lar kind of bad bet will not oc­cur. (There are prob­a­bly other kinds of ques­tion­able bets, but those may be de­sir­able or even nec­es­sary.)

By the same ar­gu­ment, this learn­ing-es­que be­havi­our will mean that if is ob­served, then has very lit­tle im­por­tance. For the part of the policy where has been ob­served, there is only a very small in­cen­tive to guard against dis­assem­bly. So this kind of defence would still be ex­pected from a su­per­in­tel­li­gent max­i­mizer (which would be hard to make safe any­way) but would not be ex­pected from a milder op­ti­mizer.

So the fact that the sys­tem can give lit­tle im­por­tance to is helpe­ful. How­ever, the sys­tem can similarly as­sign lit­tle im­por­tance to , and this could be very harm­ful in­deed. The policy is de­signed so that if ob­ser­va­tions are made that are very cor­re­lated with , then sub­se­quent ac­tions will only pre­pare for that out­come. How­ever, if the AI’s model is faulty, and it over­con­fi­dently ex­pects , then it would re­sist shut­down. The pur­pose of a shut­down but­ton is to halt a malfunc­tion­ing AI, af­ter all, so this is a fatal ob­jec­tion to the pro­posal.

A rem­edy for over­con­fi­dence: lower-bounded coun­ter­fac­tual corrigibility

A po­ten­tial rem­edy for over­con­fi­dence is to al­low the sys­tem to learn that shut­down is cer­tain, but not to learn that it is im­pos­si­ble. This would avoid the over­con­fi­dence of but would also re­duce or com­pletely avoid the two prob­lems of . So I sug­gest a lower-bounded coun­ter­fac­tual cor­rigi­bil­ity, in which the pro­posal is se­lected to max­i­mize:

where and .

Let us con­sider its perfor­mance with re­spect to the kinds of prob­lems.

First, this score over­comes the is­sue of over­con­fi­dence. No mat­ter the AI sys­tem’s be­liefs about , the score always gives weight of at least to , which has a coun­ter­fac­tual in which the but­ton is pressed. So no mat­ter what the sys­tem ob­serves, it still has an in­cen­tive to shut down.

Se­cond, (as far as I can tell), it fixes the prob­lem of defend­ing against back­stops. For situ­a­tions where the AI sees its but­ton is about to be pressed, it will sim­ply give hardly any weight to the sce­nario where the but­ton jams up, and will not defend against this.

Third, it does not elimi­nate bad bets, but re­duces them a lit­tle com­pared to equal weight­ing pro­pos­als. always over­es­ti­mates the prob­a­bil­ity of by a slight ad­di­tive amount . This means that the AI will gen­er­ally pay too much for coupons that pay out in and slightly too lit­tle for other coupons. In other words, the AI slightly over­pre­pares for rel­a­tive to what is likely given its ob­ser­va­tions. The worst case here would be if the AI is (rightly) sure that will not oc­cur. In that case, it will care as much about the coun­ter­fac­tual as the one. This is at least less than in the equal-weight­ing pro­pos­als.

Can we live with bad bets?

At least in the paradigm of tak­ing lin­ear com­bi­na­tions of coun­ter­fac­tu­als, and per­haps in gen­eral, if you’re go­ing to guaran­tee that you always care about the coun­ter­fac­tual , you’re also go­ing to some­times force your­self to over­weight it rel­a­tive to your un­ad­justed be­liefs. This sug­gests that it is im­pos­si­ble to avoid bad bets, while also always car­ing about the coun­ter­fac­tual . So when should we worry about these bad bets, and when shouldn’t we?

If is re­ally in­sen­si­tive to the amount of re­sources, then bad bets don’t mat­ter much. Sup­pose has some fixed up­per bound that can eas­ily and re­li­ably be reached with the agent’s ini­tial re­sources. Then the agent would have no rea­son to pre­pare less for other coun­ter­fac­tu­als in or­der to pre­pare bet­ter for . Ul­ti­mately, this might be our hope for a sys­tem us­ing lower-bounded coun­ter­fac­tual cor­rigi­bil­ity. We could have an agent that always slightly over­es­ti­mates the prob­a­bil­ity of shut­ting down, but so long as it does not op­ti­mize to hard, this doesn’t mat­ter, be­cause it has the re­sources it needs to shut down already. Un­for­tu­nately, if the AI ex­ists in an en­vi­ron­ment that is very volatile, for ex­am­ple if it in­cludes a lot of other ex­is­ten­tial risks, then it could be very costly to en­sure that your shut­down but­ton works cor­rectly.

For the more gen­eral prob­lem of bad bets, we will prob­a­bly have to say some­thing like the fol­low­ing: ei­ther you (i) lower-bound the amount you care about , in which case you fall prey to bad bets, or you (ii) let the amount that you care about be dic­tated by the prob­a­bil­ities, in which case you will be­come over­con­fi­dent of if your model is wrong. If the coun­ter­fac­tual task is re­source-hun­gry, like man­u­fac­tur­ing ta­bles, then you use (ii), the sym­met­ri­cal kind of cor­rigi­bil­ity that merely avoids ma­nipu­lat­ing the ob­jec­tive. If the coun­ter­fac­tual task is mis­sion crit­i­cal, such as shut­ting down, then you use the (i), the lower-bound kind of cor­rigi­bil­ity.

Can we avoid bad bets?

I don’t see how this can be done while also un­con­di­tion­ally giv­ing some weight to a shut­down coun­ter­fac­tual.

One idea that I found at least marginally in­ter­est­ing is to take a min­i­mum over mul­ti­ple prob­a­bil­ity dis­tri­bu­tions. This could be viewed as hav­ing an AI that views the prob­a­bil­ity dis­tri­bu­tion as un­know­able that one must be fun­da­men­tally ro­bust to. The policy is se­lected by:

where is a set of prob­a­bil­ity dis­tri­bu­tions like and the uniform dis­tri­bu­tion of .

But tak­ing a min­i­mum does not ul­ti­mately avoid bad-bets. An agent that takes a min­i­mum over dis­tri­bu­tions would still trade away prepa­ra­tion on ob­jec­tive for slightly bet­ter perfor­mance on an ob­jec­tive that it is slightly worse at. This doesn’t seem like what we want.

Other limi­ta­tions of lower-bounded coun­ter­fac­tual corrigibility

There are still a bunch more limi­ta­tions with the lower-bounded coun­ter­fac­tual cor­rigi­bil­ity for­mu­la­tion:

  • Like all the for­mu­la­tions, it re­quires a causal graph, which might be differ­ent from what a trans­for­ma­tive AI uses by de­fault.

  • Th­ese for­mu­la­tions make the AI “cu­ri­ous” about coun­ter­facted vari­ables. But the AI might be­come all too cu­ri­ous about them. If it is not satis­fied by look­ing at the but­ton state, it might need to dis­assem­ble and in­ter­ro­gate the hu­man in or­der to be a lit­tle more cer­tain about which state the but­ton is in. Pos­si­bly mild op­ti­miza­tion would stop the AI from try­ing too hard at “cu­ri­os­ity”.

I ex­pect a bunch more prob­lems to emerge, be­cause the pres­ence of bad bets is con­cern­ing, and be­cause all pro­pos­als in this area seem to end up hav­ing many prob­lems that are not ini­tially seen!


Thanks Rob Gra­ham for some feed­back about clar­ity of pre­sen­ta­tion of , and for slightly im­prov­ing the for­mu­la­tion of .