Relaxed adversarial training for inner alignment

This post is part of re­search I did at OpenAI with men­tor­ing and guidance from Paul Chris­ti­ano. It also rep­re­sents my cur­rent agenda re­gard­ing what I be­lieve is the best ap­proach for ad­dress­ing in­ner al­ign­ment.

One par­tic­u­larly con­cern­ing failure mode for any sort of ad­vanced AI sys­tem is for it to have good perfor­mance on the train­ing dis­tri­bu­tion, but nev­er­the­less still have some in­put on which it be­haves highly “un­ac­cept­ably” dur­ing de­ploy­ment. For the pur­poses of this post, we will fol­low Paul Chris­ti­ano in think­ing about ac­cept­abil­ity as some con­cept that satis­fies the fol­low­ing two con­di­tions:

  1. “As long as the model always be­haves ac­cept­ably, and achieves a high re­ward on av­er­age, we can be happy.”

  2. “Re­quiring a model to always be­have ac­cept­ably wouldn’t make a hard prob­lem too much harder.”

Thus, ac­cept­abil­ity is a differ­ent con­di­tion than ask­ing for good perfor­mance dur­ing de­ploy­ment—an hon­est mis­take is con­sid­ered ac­cept­able, as is just failing to do any­thing in­ter­est­ing at all. That be­ing said, some of the sorts of things you might want from ac­cept­abil­ity could in­clude cor­rigi­bil­ity, non-de­cep­tion, or my­opia.[1] The hope for this style of ap­proach is that it is com­pet­i­tive to only ap­ply ad­vanced ma­chine learn­ing to those situ­a­tions which have clear no­tions of ac­cept­abil­ity.

The ten­sion, how­ever, rests in be­ing able to achieve good perfor­mance while main­tain­ing an ac­cept­abil­ity guaran­tee. Thus, the ideal solu­tion to this prob­lem would be a way of con­vert­ing a train­ing pro­ce­dure which can achieve ar­bi­trary perfor­mance into a train­ing pro­ce­dure which is not much harder to run, and which can still achieve ar­bi­trary perfor­mance, but which also en­forces an ac­cept­abil­ity guaran­tee.

One ob­vi­ous way to at­tempt to main­tain an ac­cept­abil­ity guaran­tee is to con­struct a loss func­tion which in­cen­tivizes ac­cept­abil­ity as part of perfor­mance. Some­thing of this form is al­most cer­tainly go­ing to be a com­po­nent of any safe sys­tem. That be­ing said, there are some rea­sons why ap­proaches of this form might be in­suffi­cient if used in iso­la­tion.

First, one im­por­tant thing to note about fold­ing ac­cept­abil­ity into perfor­mance is that the re­sult­ing ac­cept­abil­ity guaran­tee is fairly weak: you get an av­er­age-case ac­cept­abil­ity guaran­tee that holds only in the case of no dis­tri­bu­tional shift, with no guaran­tee when your de­ploy­ment dis­tri­bu­tion is differ­ent than your train­ing dis­tri­bu­tion.[2]

Se­cond, there are rea­sons to be­lieve that the ob­jec­tive of a model do­ing some sort of in­ter­nal plan­ning/​search might be sys­tem­at­i­cally differ­ent than its loss func­tion. Thus, even if the loss func­tion in­cen­tivizes ac­cept­abil­ity, the train­ing pro­cess’s in­duc­tive bi­ases might be such that the model ends up not ac­tu­ally op­ti­miz­ing for an ac­cept­able ob­jec­tive. This is the prob­lem “Risks from Learned Op­ti­miza­tion” refers to as in­ner al­ign­ment,[3] though other frame­works have also been pro­posed for un­der­stand­ing this sort of phe­nomenon. Ortega et al. re­cently pre­sented in a Deep­Mind Tech­ni­cal Re­port, for ex­am­ple, an anal­y­sis of this prob­lem from the per­spec­tive of treat­ing it as spon­ta­neous meta-learn­ing, not­ing that:

“Meta-learn­ing can also oc­cur spon­ta­neously in on­line re­gres­sion when the ca­pac­ity of the agent is bounded and the data is pro­duced by a sin­gle gen­er­a­tor. Un­for­tu­nately, the down­side is that we can­not eas­ily con­trol what will be met­alearned. In par­tic­u­lar, spon­ta­neous meta-learn­ing could lead to un­de­sir­able emer­gent prop­er­ties, which is con­sid­ered an open re­search prob­lem in AI safety [Ortega et al., 2018].”

The key prob­lem that arises in this sort of situ­a­tion is that it raises the pos­si­bil­ity of ca­pa­bil­ity gen­er­al­iza­tion with­out ob­jec­tive gen­er­al­iza­tion. That is, for a model do­ing some sort of search/​meta-learn­ing, it might be that its search al­gorithm gen­er­al­izes well off-dis­tri­bu­tion such that it con­tinues be­ing able to take com­pe­tent, well-op­ti­mized ac­tions, but that the ob­jec­tive its search is di­rected to­wards fails to gen­er­al­ize such that those ac­tions are now op­ti­mized for the wrong ob­jec­tive.

Fi­nally, there are rea­sons that we might want to con­sider the pos­si­bil­ity of near-AGI mod­els learn­ing to de­ceive their train­ing pro­cesses. The full ar­gu­ment for why this might be the case is given in the sec­tion on de­cep­tive al­ign­ment in “Risks from Learned Op­ti­miza­tion,” but the ba­sic con­cern is that a model could learn about the true re­ward via its in­put and thus start ap­pear­ing al­igned for the pur­pose of stick­ing around in the train­ing pro­cess be­fore gra­di­ent de­scent has the chance to mod­ify it to ac­tu­ally op­ti­mize for the true re­ward in­ter­nally.


If in­cen­tiviz­ing ac­cept­abil­ity as part of perfor­mance is in­suffi­cient, as it seems it might be, then a bet­ter solu­tion for en­forc­ing ac­cept­abil­ity guaran­tees is re­quired. One frame­work Paul has pro­posed is to at­tempt to make use of ad­ver­sar­ial train­ing, wherein you could train an ad­ver­sary to find in­puts on which the model be­haves un­ac­cept­abil­ity and then pe­nal­ize it ac­cord­ingly. The goal of this post is to ex­plore this av­enue as thor­oughly as pos­si­ble.

One ini­tial road­block with this ap­proach, how­ever, is that there might ex­ist some in­puts on which the model be­haves un­ac­cept­ably which are difficult to in­stan­ti­ate dur­ing train­ing—fu­ture situ­a­tions that can­not be pre­dicted in com­plete de­tail, for ex­am­ple.[4] Paul’s pro­posed solu­tion to this prob­lem is to re­lax the ad­ver­sary’s job. Rather than ask­ing the ad­ver­sary to in­stan­ti­ate a con­crete in­put on which the model be­haves un­ac­cept­ably, we can in­stead just ask the ad­ver­sary to pro­duce some sort of de­scrip­tion of such a situ­a­tion, which we will call a pseudo-in­put. Paul pro­poses the fol­low­ing two con­di­tions as the ma­jor desider­ata for any pseudo-in­put-based ap­proach:

  1. “It’s still ‘easy enough’ for the agent to never be­have un­ac­cept­ably for any pseudo-in­put.”

  2. “It’s easy for the ad­ver­sary to find a pseudo-in­put on which the agent be­haves un­ac­cept­ably, if any ex­ist.”

Un­der Paul’s frame­work, the struc­ture of pseudo-in­puts has to be con­strained so that “an in­put that con­tains X” is a valid pseudo-in­put but “a dis­tri­bu­tion over in­puts that would cause the agent to want to kill ev­ery­one” is not. That way, the model’s task of steer­ing clear of hav­ing any pseudo-in­puts on which it be­haves badly is ac­tu­ally po­ten­tially sol­u­ble.


We’ll start by for­mal­iz­ing Paul’s ap­proach so we can use it as a start­ing point. Let be the space of pos­si­ble states (which we will as­sume in­cludes any ques­tion we want to ask the model as a state), be the space of pos­si­ble ac­tions, be the model, be the space of all pos­si­ble mod­els, be some am­plifi­ca­tion op­er­a­tion,[5] and be the space of pos­si­ble pseudo-in­puts. Then, as per Paul’s pro­posal, we’ll implement

our ad­ver­sary as our un­ac­cept­abil­ity checker as and fi­nally our model’s un­ac­cept­abil­ity penalty as

That’s our start­ing point, but can we do bet­ter? One ini­tial ob­ser­va­tion is that this ap­proach re­lies heav­ily on con­strain­ing to ex­clude the sorts of pseudo-in­puts that the model shouldn’t have to deal with such as “a dis­tri­bu­tion over in­puts such that kills all hu­mans.” If we want to al­low very gen­eral pseudo-in­puts, how­ever, then this con­straint might not be very nat­u­ral to im­ple­ment. Fur­ther­more, it’s un­clear that a bi­nary con­straint is ac­tu­ally what we want here: we care sig­nifi­cantly more about pseudo-in­puts that are more likely to come up dur­ing de­ploy­ment.[6] In fact, if we think of pseudo-in­puts as pred­i­cates that con­strain , we can ap­prox­i­mate the prob­a­bil­ity of un­ac­cept­able be­hav­ior dur­ing de­ploy­ment as[7]

such that, if we can get a good im­ple­men­ta­tion of , we no longer have to worry as much about care­fully con­strain­ing , as we can just let ’s prior do that work for us. Thus, we’ll define[8]

such that we get the new un­ac­cept­abil­ity penalty[9]

which is the pri­mary model I’ll be work­ing with for the rest of this post.

Hard cases

Now that we have our model, we’ll look at some of the hard cases that have to be solved for any sort of ap­proach of this form to go through.

Ran­dom un­ac­cept­abil­ity. One of the sim­plest ex­am­ples of an un­ac­cept­able model is one that sim­ply has some low prob­a­bil­ity of un­ac­cept­abil­ity such that it is difficult to pick up on train­ing but still pos­si­ble to see dur­ing de­ploy­ment. If you can guaran­tee that model sam­ples will be in­de­pen­dently and iden­ti­cally dis­tributed (iid), and you have an av­er­age case perfor­mance guaran­tee over some met­ric that in­cludes ac­cept­abil­ity, then you might hope to get an av­er­age case ac­cept­abil­ity guaran­tee which can miti­gate such a case of low prob­a­bil­ity un­ac­cept­able be­hav­ior. How­ever, it is un­clear ex­actly how difficult get­ting an iid guaran­tee might be as well as how good the re­sult­ing ac­cept­abil­ity guaran­tee would be. As a sim­ple model, sup­pose there are train­ing steps and de­ploy­ment steps. Then, let be the prob­a­bil­ity that our model be­haves ac­cept­ably at each step, where we will as­sume that steps are iid. Then, we want to know the prob­a­bil­ity of un­ac­cept­able be­hav­ior dur­ing de­ploy­ment given weak as­sump­tions on other than that we know it was ac­cept­able dur­ing train­ing. Thus, if we adopt a uniform prior[10] over and up­date on see­ing in­stances of ac­cept­able be­hav­ior, we get the pos­te­rior , which yields the over­all prob­a­bil­ity of un­ac­cept­able be­hav­ior dur­ing de­ploy­ment of

Thus, if we have an iid guaran­tee and we limit our­selves only to cases where , we’re likely to be fairly safe.

How­ever, is a hard con­di­tion to satisfy[11] —can we do bet­ter? In par­tic­u­lar, what hap­pens if we can deal with one in­stance of un­ac­cept­able be­hav­ior, but not mul­ti­ple? We will con­sider two pos­si­ble mod­els: in the se­quen­tial model, it takes se­quen­tial in­stances of un­ac­cept­able be­hav­ior dur­ing de­ploy­ment to be un­ac­cept­able, whereas in the non-se­quen­tial model it takes in­stances of un­ac­cept­able be­hav­ior at any time dur­ing de­ploy­ment to be un­ac­cept­able. Then, let be the prob­a­bil­ity of un­ac­cept­able de­ploy­ment in the origi­nal model, be the prob­a­bil­ity of an un­ac­cept­able de­ploy­ment in the se­quen­tial model, and be the equiv­a­lent for the non-se­quen­tial model. Then, run­ning a se­ries of ex­per­i­ments us­ing iter­a­tive black-box op­ti­miza­tion to perform a max­i­mum like­li­hood es­ti­mate,[12] we get[13]

Thus, if we can man­age to be in one of these regimes, we can get a sig­nifi­cantly bet­ter ac­cept­abil­ity guaran­tee, mak­ing it pos­si­ble that an iid guaran­tee might be suffi­cient to at least solve the ran­dom un­ac­cept­abil­ity prob­lem.

How­ever, get­ting an iid guaran­tee in the first place might be quite difficult, as it would seem to re­quire there be­ing ab­solutely no de­tectable differ­ences be­tween the train­ing and de­ploy­ment dis­tri­bu­tions. This is es­pe­cially difficult if the model is ac­tively look­ing for differ­ences be­tween the train­ing and de­ploy­ment dis­tri­bu­tions, as would hap­pen in the next hard case.

Con­di­tional defec­tion. Sup­pose you have a de­cep­tive model—that is, one which is try­ing to ap­pear al­igned dur­ing train­ing, but ac­tu­ally wants to do some­thing else dur­ing de­ploy­ment.[14] Then, it’s go­ing to be search­ing for ways to tell whether it’s in train­ing or not such that it might be quite difficult for an ad­ver­sary to ac­tu­ally find any in­puts on which it’s un­ac­cept­able. Paul pro­vides the ex­am­ple that one pos­si­ble check the model could perform might be to look for a fac­tor­iza­tion of RSA-2048 in its in­put. RSA-2048 is a large semiprime num­ber that was fea­tured as part of the RSA Fac­tor­ing Challenge. Though no fac­tor­iza­tion of RSA-2048 is cur­rently known, it is likely that at some point some­one will pub­lish such a fac­tor­iza­tion. Thus, by only be­hav­ing un­ac­cept­ably when it sees a fac­tor­iza­tion of RSA-2048, a de­cep­tive model can all but guaran­tee that ad­ver­sar­ial train­ing will never be able to pro­duce an in­put on which it be­haves un­ac­cept­ably—since that would re­quire fac­tor­ing the num­ber—but that will still be fairly likely to even­tu­ally oc­cur dur­ing de­ploy­ment.

The RSA-2048 ex­am­ple gen­er­al­izes to a large fam­ily of cases wherein a de­cep­tive model is con­di­tion­ing its un­ac­cept­able be­hav­ior on some pred­i­cate such that it is eas­ier to check whether holds for some than it is to gen­er­ate an for which holds. In the RSA-2048 case, How­ever, there are many other pos­si­ble con­cern­ing . For ex­am­ple, if the train­ing data in­cludes lots of web text, one sim­ple vari­able which is likely to be highly cor­re­lated with whether the model is in train­ing or not is the date. Thus, one pos­si­ble might be . Another pos­si­bil­ity might be to choose an for which it is not cur­rently known whether there ex­ists any which satis­fies it. This could be a prob­lem in any situ­a­tion where is re­quired to defini­tively demon­strate that . For ex­am­ple, con­sider . In the RSA-2048 ex­am­ple, can prove that there is some string that con­tains a fac­tor­iza­tion of RSA-2048 (since it is known that RSA-2048 is not prime), but in this ex­am­ple, that is no longer pos­si­ble.

Fur­ther­more, even in the situ­a­tion where a de­cep­tive model can­not de­tect any differ­ences be­tween the train­ing and de­ploy­ment dis­tri­bu­tions, it may still at­tempt to ran­domly defect some pro­por­tion of the time in the hope that its defec­tion oc­curs dur­ing de­ploy­ment. How­ever, if it can be en­forced that sam­ples from the model dur­ing train­ing and de­ploy­ment are in fact iid, then you can still get an av­er­age case ac­cept­abil­ity guaran­tee as above. In­ter­est­ingly, if we take the model from above and as­sume that gets unit util­ity for defect­ing at some point dur­ing de­ploy­ment and zero util­ity oth­er­wise we can find ’s op­ti­mal per-step co­op­er­a­tion prob­a­bil­ity and use that to de­ter­mine the over­all prob­a­bil­ity of un­ac­cept­able be­hav­ior dur­ing de­ploy­ment if we end up with such a model. In par­tic­u­lar, in the case where our train­ing pro­cess can only deal with con­di­tional defec­tion, but always pro­duces a model pur­su­ing ran­dom defec­tion, we get a prob­a­bil­ity of un­ac­cept­able be­hav­ior of which is pre­cisely the re­sult de­rived from us­ing a max­i­mum like­li­hood es­ti­mate in the ran­dom defec­tion case rather than a uniform Bayesian prior.

Another in­ter­est­ing pos­si­ble phe­nomenon that could arise in the de­cep­tive case is that some in­puts might be in­her­ently more im­por­tant to the model such that it is more likely to defect on those in­puts than oth­ers. If the model’s no­tion of which in­puts are most im­por­tant to defect on matches up with those in­puts where defec­tion is likely to be most prob­le­matic, then that could be an­other way for the situ­a­tion to be worse than the above guaran­tee even in the iid case.[15]

Meta-learn­ing.Risks from Learned Op­ti­miza­tion” sug­gests that the case where in­ner al­ign­ment is most con­cern­ing is the case in­volv­ing a model that is it­self im­ple­ment­ing some sort of search or learn­ing pro­cess. Pri­mar­ily, we have been as­sum­ing that the prob­lem with this situ­a­tion is the ob­jec­tive by which this search is be­ing performed, and that if that ob­jec­tive can be al­igned with the loss func­tion, then we are satis­fied. How­ever, this is not in gen­eral the case, as even if a model’s ob­jec­tive is al­igned, its search pro­cess might not be. Con­cep­tu­ally, we can think of this prob­lem as the prob­lem of “for­ward­ing” the safety guaran­tee we have about the train­ing pro­cess to the meta-learn­ing pro­cess.

My coun­terex­am­ple to the safety of min­i­mal cir­cuits is a the­o­ret­i­cal ex­am­ple of what a failure of this form might look like. Alter­na­tively, as a more con­crete ex­am­ple, con­sider a model which has learned to do some form of meta-learn­ing such that it is im­ple­ment­ing its own ex­pected risk min­i­miza­tion pro­ce­dure. In this case, even if we have a strong guaran­tee that our model isn’t go­ing to be de­ceiv­ing us, we might not have the same guaran­tee about what­ever policy re­sults from the model’s meta-learn­ing. Fur­ther­more, perform­ing search in a par­tic­u­lar way could be a form of defec­tion: a de­cep­tive model could choose to perform a policy search that is more likely to pro­duce mod­els al­igned with it than it is to pro­duce mod­els al­igned with the ac­tual loss func­tion.

Subop­ti­mal­ity al­ign­ment. One con­cern­ing form of mis­al­ign­ment dis­cussed in “Risks from Learned Op­ti­miza­tion” is that of sub­op­ti­mal­ity al­ign­ment, wherein the model ap­pears al­igned on the train­ing dis­tri­bu­tion due to some sort of defect or rea­son­ing failure. For ex­am­ple, it might be that a model is cur­rently non-de­cep­tive, but if given more time to think about the best strat­egy for it to im­ple­ment, or more in­for­ma­tion about its cur­rent situ­a­tion, it would be­come de­cep­tive. This is a par­tic­u­larly con­cern­ing case be­cause it im­plies that not only must the over­seer ver­ify that there is no un­ac­cept­able be­hav­ior cur­rently hap­pen­ing in the model, it must also ver­ify that there will con­tinue to be no un­ac­cept­able be­hav­ior re­gard­less of the pos­si­ble situ­a­tions that the model could find it­self in dur­ing de­ploy­ment. Fur­ther­more, if the en­vi­ron­ment con­tains ad­ver­saries (other AI sys­tems, for ex­am­ple) that can speci­fi­cally iden­tify those situ­a­tions in which the model will be­have de­cep­tively, this sort of failure mode could be quite difficult to avoid.

Ac­cept­abil­ity mod­el­ing. Another in­ter­est­ing hard case is that of a model which is it­self im­ple­ment­ing some un­ac­cept­abil­ity pred­i­cate to guide its be­hav­ior. In this situ­a­tion, the over­seer must look for any de­vi­a­tion be­tween and to ver­ify the model. One thing which is likely to be par­tic­u­larly helpful in this situ­a­tion is if the over­seer has ac­cess to the his­tory of the model so that it can see how has been trained and ver­ify that it wasn’t pur­pose­fully con­structed to de­ceive the over­seer. One par­tic­u­larly con­cern­ing sce­nario might be if the model is us­ing as a tool for de­ter­min­ing when to defect, defect­ing only when it pre­dicts the over­seer won’t see its defec­tion as un­ac­cept­able.

Con­di­tions for success

In light of the above hard cases, we can at­tempt to es­tab­lish those con­di­tions that seem nec­es­sary for us to get a strong ac­cept­abil­ity guaran­tee out of our frame­work as pre­sented pre­vi­ously. First, we can fac­tor the prob­lem of get­ting the above frame­work to work into two broad sub­prob­lems:

  1. The re­cur­sion prob­lem: how do you pre­vent from be­hav­ing un­ac­cept­ably if is be­hav­ing un­ac­cept­ably?

  2. The over­sight prob­lem: even as­sum­ing is ac­cept­able, how can it over­see an un­ac­cept­able into be­com­ing ac­cept­able?

The fact that these two sub­prob­lems come apart here is due to one of the core trick­eries we are rely­ing on with this ap­proach: we are ask­ing the model to be able to over­see it­self. In par­tic­u­lar, since are all im­ple­mented by call­ing , it has to be the case that, for a mis­al­igned , not only can de­tect that mis­al­ign­ment, but also it shouldn’t it­self be mis­al­igned.

While the re­cur­sion prob­lem is very similar for all the differ­ent com­po­nents of the sys­tem, the over­sight prob­lem can be split into a num­ber of dis­tinct sub­prob­lems:

  1. if thinks the prob­a­bil­ity of un­ac­cept­able be­hav­ior dur­ing de­ploy­ment is less than , then the prob­a­bil­ity of un­ac­cept­able be­hav­ior dur­ing de­ploy­ment is ac­tu­ally less than ,

  2. if has prob­a­bil­ity of pro­duc­ing un­ac­cept­able be­hav­ior dur­ing de­ploy­ment greater than , then there ex­ists some that will con­vince of that fact,

  3. it is not that hard for to find such an , and

  4. there ex­ists a model that is not that hard to find such that is con­fi­dent that the prob­a­bil­ity of pro­duc­ing un­ac­cept­able be­hav­ior dur­ing de­ploy­ment is less than .

I will start by ad­dress­ing the re­cur­sion prob­lem, then move on to each of the above per-con­di­tion over­sight prob­lems.

The re­cur­sion prob­lem. First, we’re go­ing to need to rely heav­ily on the de­tails of to make this work. Hav­ing hu­mans in the loop as part of —at least at the be­gin­ning—seems likely to be crit­i­cal, for ex­am­ple, to en­sure that is more al­igned and more ca­pa­ble than . Se­cond, the re­cur­sion doesn’t ac­tu­ally have to work for all —rather, we only need our ac­cept­abil­ity guaran­tee to hold along any path through model space that could ac­tu­ally be tra­versed by our train­ing pro­ce­dure.

Thus, it’s fine if our pro­ce­dure fails if given a suffi­ciently un­ac­cept­able model, as long as our pro­ce­dure can pre­vent such mod­els from ever aris­ing. For ex­am­ple, if we always used an old ver­sion of that we were con­fi­dent in for im­ple­ment­ing , then we would just need it to be the case that couldn’t get to the point where it would be able to de­ceive in-be­tween over­sight. As a con­crete im­ple­men­ta­tion, imag­ine some se­quence of and sup­pose the min­i­mum num­ber of steps for over­sight to start failing is .[16] Then, at step we can im­ple­ment over­sight by call­ing to bi­nary search on look­ing for any point at which un­ac­cept­able be­hav­ior is in­tro­duced and then re­train from right be­fore that point in such a way that fixes the un­ac­cept­abil­ity.

If we use an im­ple­men­ta­tion of this form, then the difficulty of the re­cur­sion prob­lem is likely to de­pend heav­ily on the difficulty of the over­sight prob­lem: the fewer asks we need to make of to pro­duce good over­sight, and the bet­ter our over­sight is at de­tect­ing prob­lems, the more lee­way we have. Thus, though I will pri­mar­ily be ad­dress­ing the over­sight prob­lem, I an­ti­ci­pate that all of my dis­cus­sion of the over­sight prob­lem will also be highly rele­vant to the re­cur­sion prob­lem.

If thinks the prob­a­bil­ity of un­ac­cept­able be­hav­ior dur­ing de­ploy­ment is less than , then the prob­a­bil­ity of un­ac­cept­able be­hav­ior dur­ing de­ploy­ment must ac­tu­ally be less than . There are two ma­jor com­po­nents to this con­di­tion: must have a good model of un­ac­cept­able be­hav­ior and must be well-cal­ibrated over whether some model will pro­duce out­put which is un­ac­cept­able ac­cord­ing to that model dur­ing de­ploy­ment. For the first con­di­tion, if we as­sume that ac­cept­abil­ity is a bright enough line that we can train a model to clearly dis­t­in­guish be­tween ac­cept­able and un­ac­cept­able ac­tions, then we can satisfy that con­di­tion. The sec­ond con­di­tion, how­ever—of ac­tu­ally be­ing well-cal­ibrated—is some­what more com­pli­cated. Speci­fi­cally, we need to be well-cal­ibrated over ques­tions of the form and , since those are the ques­tions we’re go­ing to be ask­ing it.

First, has to be well-cal­ibrated over . In par­tic­u­lar, this re­quires to have a good model of the de­ploy­ment dis­tri­bu­tion. This seems tough—and pos­si­bly prob­le­matic if it also re­quires to know more about the differ­ences in dis­tri­bu­tions such that it can more effec­tively choose when to defect—though in fact it’s not quite as bad as it might seem: it’s fine if is fairly un­cer­tain about most —it just can’t be so un­cer­tain that it be­comes im­pos­si­ble to ever re­duce the com­bined prob­a­bil­ity be­low ’s prior on .

Se­cond, has to be well-cal­ibrated over . This is even more difficult be­cause can­not ac­tu­ally sam­ple at all from that dis­tri­bu­tion, as could re­quire some­thing like a fac­tor­iza­tion of RSA-2048. How­ever, there are some rea­sons to be­lieve that this might still be tractable. First, doesn’t have to an­swer ar­bi­trary ques­tions about this dis­tri­bu­tion—only the prob­a­bil­ity of . And if we as­sume that a pred­i­cate that has no knowl­edge of or way of test­ing won’t af­fect the prob­a­bil­ity of , then that means that only has to care about those which ac­tu­ally mat­ter to . Fur­ther­more, since is an am­plified ver­sion of , it should know about all those which knows about, and thus should have a real shot at be­ing able to an­swer ques­tions con­di­tional on .

If has prob­a­bil­ity of pro­duc­ing un­ac­cept­able be­hav­ior dur­ing de­ploy­ment greater than , then there must ex­ist some that will con­vince of that fact. Since our pro­ce­dure can han­dle ar­bi­trary , we can make large enough to con­tain what­ever we might need. Thus, the hard part of this con­di­tion is con­vinc­ing . As­sum­ing away the re­cur­sion prob­lem such that is try­ing to do the right thing, it still needs to be the case that has no way of trick­ing into be­liev­ing that it’s ac­cept­able when it in fact isn’t. This con­di­tion is likely to de­pend heav­ily on how good ’s trans­parency tools are, and in par­tic­u­lar how ro­bust they are to at­tempted de­cep­tion on be­half of .

Im­por­tantly, how­ever, doesn’t have to be con­fi­dent that is un­ac­cept­able—it just has to be the case that can’t con­vince to be con­fi­dent that it is ac­cept­able. Thus, since should be highly un­cer­tain by de­fault—and should stay that way un­less the model is highly trans­par­ent—this con­di­tion should not be too difficult to satisfy. In par­tic­u­lar, it needs it to be the case that Amp(M) is only con­vinced that a pseudo-in­put is fine if M is suffi­ciently trans­par­ent such that Amp(M) knows it would be able to de­tect any un­ac­cept­able be­hav­ior if it ex­isted.

It must not be that hard for to find such an . In his origi­nal for­mu­la­tion, Paul noted that the “most fun­da­men­tal ob­struc­tion is im­ple­ment­ing a good ad­ver­sary.” In this for­mu­la­tion, how­ever, much of the difficulty has been trans­ferred from the ad­ver­sary to the prob­a­bil­ity model. For ex­am­ple, un­der this model it is fine to in­clude an that is just “an in­put such that is un­ac­cept­able,” in which case the en­tirety of the work falls to . The work still has to hap­pen some­where, how­ever: ei­ther needs to en­code lots of rele­vant in­for­ma­tion for , or has to be able to dis­cover that in­for­ma­tion on its own.

Thus, let’s sup­pose that most of the rele­vant work is hap­pen­ing in . In that case, we need to be able to figure out when is likely to be­have un­ac­cept­ably. As with the pre­vi­ous con­di­tion, this is likely to de­pend heav­ily on how trans­par­ent can make it­self: there has to be some level of trans­parency en­force­able by such that only good can pass with­out let­ting pro­duce an that con­vinces . Im­por­tantly, how­ever—as with the pre­vi­ous con­di­tion—since is un­cer­tain by de­fault, doesn’t have to be that con­vinc­ing: it only has to be good enough to con­vince that there is non-neg­ligible chance of un­ac­cept­abil­ity. Thus, the key con­di­tion here is that any un­ac­cept­able model must have some at least mod­er­ately plau­si­ble failure mode which is not that hard to find. Then, for opaque mod­els, can do this by com­ing up with some­thing that isn’t ruled out by the trans­par­ent por­tions, and for trans­par­ent mod­els, it should be able to di­rectly de­ter­mine any failure modes.

There must ex­ist a model that is not that hard to find such that is con­fi­dent that the prob­a­bil­ity of pro­duc­ing un­ac­cept­able be­hav­ior dur­ing de­ploy­ment is less than . First, note that the first con­di­tion gives us that is ac­tu­ally good if thinks it is. Thus, this con­di­tion is say­ing that it must not be that hard to find an ac­cept­able such that can ac­tu­ally ver­ify ’s ac­cept­abil­ity. If we as­sume that it is not that hard to pro­duce only ac­cept­able ac­tions—that do­ing so doesn’t in­terfere much with perfor­mance and that an ac­tion’s ac­cept­abil­ity is not that hard a con­di­tion to de­tect—then there should ex­ist some ac­cept­able which is not that hard to find.

The hard part, how­ever, is that there has to ex­ist a not-that-hard-to-find which is not just ac­cept­able but also ver­ifi­ably ac­cept­able. Not only does have to be ac­cept­able and not that hard to find, it has to ac­tu­ally have some way of con­vinc­ing of that fact. This is likely to de­pend heav­ily on how trans­par­ent can make it­self to , since if we are as­sum­ing that is ac­cept­able then it should ac­tu­ally be try­ing to make it­self trans­par­ent, which should help. How­ever, whether its ac­cept­abil­ity can be eas­ily ver­ified up to a high de­gree of cer­tainty by is go­ing to de­pend on the de­gree to which it can ac­tu­ally suc­ceed at that task. Since this con­di­tion ac­tu­ally re­quires to achieve a high de­gree of cer­tainty, it could be much more difficult than the other con­di­tions: the de­fault of always de­fault­ing to its prior and be­ing highly un­cer­tain doesn’t work here. Thus, this con­di­tion—of there ac­tu­ally ex­ist­ing a highly-trans­par­ent, com­pet­i­tive model—seems to be the most crit­i­cal.

The core prob­lem: transparency

A theme com­mon to many of the con­di­tions for suc­cess given above—es­pe­cially the last, hard­est con­di­tion—is trans­parency. If can make it­self trans­par­ent enough that can ver­ify its ac­cept­abil­ity in such a way that doesn’t make too much harder to find or hurt its perfor­mance too much, then—with a few caveats—the above pro­posal should mostly go through. Alter­na­tively, if no such trans­parency method ex­ists, then the above ap­proach—at least as pre­sented here—seems to have lit­tle chance of work­ing. Thus, we can frame the work pre­sented here as a way of elu­ci­dat­ing ex­actly what sorts of trans­parency tech­niques are likely to be nec­es­sary to solve the over­all al­ign­ment prob­lem. In par­tic­u­lar, we would like to be able to re­duce the bur­den on trans­parency by figur­ing out what the eas­iest pos­si­ble trans­parency tech­niques are that would fully solve the prob­lem.

One pos­si­ble start­ing point for tack­ling this ques­tion of what sorts of trans­parency tech­niques are likely to be nec­es­sary is Paul’s no­tion of as­crip­tion uni­ver­sal­ity. Ascrip­tion uni­ver­sal­ity gives us a way of talk­ing about when we should trust a sys­tem, as well as when we should trust one more than an­other. Speci­fi­cally, we want it to be the case that epistem­i­cally dom­i­nates such that, re­gard­less of any in­for­ma­tion we get about the be­liefs of ei­ther or , we will always trust ’s pre­dic­tions about all quan­tities bet­ter than our own. This con­di­tion is some­what too strong, how­ever, as we don’t need to trust on all ques­tions, only those rele­vant to the over­sight pro­cess. Those ques­tions which are rele­vant to the over­sight pro­cess, how­ever, have the prop­erty of pri­mar­ily be­ing ques­tions about the in­ter­nal struc­ture of , which makes the con­di­tion of know­ing strictly more than about them some­what difficult to satisfy. Thus, the crit­i­cal trans­parency con­di­tion that this model sug­gests we need is for to know more about than does.

Un­for­tu­nately, epistemic dom­i­nance is a state­ment about a model’s ex­pec­ta­tions rather than its in­ter­nal struc­ture, which means it doesn’t give us a very mechanis­tic un­der­stand­ing of what a model which satis­fies such a con­di­tion might look like.[17] How­ever, we can still think about the sorts of trans­parency tools which might en­able us to get to the point where we could ver­ify such a con­di­tion. In par­tic­u­lar, is go­ing to need some way of fully un­der­stand­ing . We can iso­late three dis­tinct ways in which that can be done:

  1. Trans­parency via in­spec­tion: use trans­parency tools to un­der­stand via in­spect­ing the trained model.

  2. Trans­parency via train­ing: in­cen­tivize to be as trans­par­ent as pos­si­ble as part of the train­ing pro­cess.

  3. Trans­parency via ar­chi­tec­ture: struc­ture ’s ar­chi­tec­ture such that it is in­her­ently more trans­par­ent.

Though it is cur­rently un­clear which of these ap­proaches is likely to be the most fruit­ful, it seems likely that some com­bi­na­tion of all of them will be nec­es­sary. What­ever ap­proach we take, how­ever, one of the ma­jor con­di­tions will be for it to scale well to larger, more com­pli­cated mod­els so that this whole ap­proach can stay com­pet­i­tive.

Speci­fi­cally, some of the sorts of trans­parency guaran­tees that are likely to be use­ful in this do­main might in­clude:

  • The abil­ity to ac­cess ’s model of the world: that is, what facts about the world does know that are go­ing into pro­duc­ing its out­put? In par­tic­u­lar, what im­plicit knowl­edge is us­ing that isn’t rep­re­sented in its ex­plicit out­put?

  • Knowl­edge of what fac­tors most highly im­pact ‘s de­ci­sions. In par­tic­u­lar, if there is some con­di­tion such that only be­haves ac­cept­ably if is true, then we want to be able to know that ’s out­put is highly de­pen­dent on .

  • Good un­der­stand­ing of ’s coun­ter­fac­tual be­hav­ior. In par­tic­u­lar, we want to be able to know broadly what will do in situ­a­tions (such as RSA-2048) where we can’t ex­plic­itly in­stan­ti­ate an in­put which can be fed to di­rectly.

  • The abil­ity to ac­cess ’s ob­jec­tive: that is, if is run­ning some search pro­cess, what is that search pro­cess op­ti­miz­ing for? If is search­ing over pos­si­ble strate­gies which it is con­sid­er­ing im­ple­ment­ing, what crite­rion is it us­ing to se­lect among them?

In­cluded be­low are two ex­am­ples of pos­si­ble ideas for how a full trans­parency scheme might work us­ing some of the above tools.

Trans­parency for agent ob­jec­tives. First, sup­pose we as­sume the con­jec­ture that mod­els im­ple­ment­ing highly co­her­ent op­ti­miza­tion pro­cesses are the most com­pet­i­tive. If this con­jec­ture is true, it should be pos­si­ble to co­her­ently sep­a­rate such a model into pieces rep­re­sent­ing the differ­ent com­po­nents of its op­ti­miza­tion pro­cess such as its world model, op­ti­miza­tion pro­ce­dure, and ob­jec­tive.

In­spec­tion, train­ing, and ar­chi­tec­tural trans­parency meth­ods could be com­bined to achieve such a sep­a­ra­tion. First, ar­chi­tec­tural trans­parency could be used to struc­ture the model to have pieces cor­re­spond­ing to the differ­ent com­po­nents of its op­ti­miza­tion pro­cess—e.g. a sep­a­rate world model piece and op­ti­miza­tion pro­ce­dure piece. Se­cond, in­spec­tion trans­parency could be used to ver­ify that the whole world model was just in the world model piece and not in other parts of the model as well—and similarly for the other com­po­nents such as the model’s ob­jec­tive. Fi­nally, train­ing trans­parency could be used to in­cen­tivize the model to be as trans­par­ent ac­cord­ing to the in­spec­tion trans­parency pro­cess as pos­si­ble.

If such a scheme was able to work, then might be able to di­rectly ver­ify that ’s ob­jec­tive was al­igned, while for the world model and op­ti­miza­tion pro­ce­dure only hav­ing to ver­ify that they are in fact value-neu­tral. Such value-neu­tral­ity ver­ifi­ca­tion could then, for ex­am­ple, be done via an im­pact mea­sure such as At­tain­able Utility Preser­va­tion or Rel­a­tive Reach­a­bil­ity both of which are forms of mea­sur­ing the de­gree to which an ac­tion ad­van­tages one set of val­ues over oth­ers.

Pure world model trans­parency. Alter­na­tively, sup­pose we don’t ac­cept the con­jec­ture that you need to im­ple­ment a co­her­ent op­ti­miza­tion pro­cess to be com­pet­i­tive. In this world, we have an­other op­tion available to us: rather than hav­ing to en­sure we get a trans­par­ent ob­jec­tive, we can just train a model that has no ob­jec­tive—that is, we can try to train a purely pre­dic­tive model with only a world model but no op­ti­miza­tion pro­ce­dure or ob­jec­tive.

To do this, we could use ar­chi­tec­tural trans­parency to struc­ture the model to be com­posed of only a world model por­tion trained in a purely gen­er­a­tive set­ting and use in­spec­tion trans­parency to ver­ify that it is in fact only a world model and hasn’t learned any sort of op­ti­miza­tion pro­ce­dure in­stead. We could also again use train­ing trans­parency to in­cen­tivize the model to be as trans­par­ent to the in­spec­tion pro­cess as pos­si­ble.

In this situ­a­tion, the train­ing and am­plifi­ca­tion pro­ce­dures be­come some­what more com­pli­cated, as we still want to be able to ex­tract ac­tions out of some­thing that is just a world model. To do this, we could im­ple­ment via a hu­man with ac­cess to suffi­cient in­spec­tion trans­parency tools to en­able them to make use of ’s world model to guide their de­ci­sion-mak­ing, though the ob­vi­ous con­cern here is whether this would ac­tu­ally be suffi­cient to stay com­pet­i­tive with other ap­proaches.

Alter­na­tively, even if we did need our model to im­ple­ment some sort of op­ti­miza­tion pro­ce­dure to be com­pet­i­tive, ex­tract­ing just the world model might still be quite good, as it could be enough to get an iid guaran­tee. Speci­fi­cally, if we were able to un­der­stand­ing the model’s be­liefs about the differ­ences be­tween the train­ing and de­ploy­ment dis­tri­bu­tions, we might able to only de­ploy the model in situ­a­tions where we know it can’t tell it’s be­ing de­ployed, hope­fully giv­ing us a good iid guaran­tee. See “Make use of ’s model” un­der “Other ap­proaches” for a fur­ther treat­ment of this pos­si­bil­ity.

Fi­nally, a model’s world model seems to be one of the eas­iest pieces to ex­tract, as it can nat­u­rally be iden­ti­fied by look­ing for the por­tions of with the great­est mu­tual in­for­ma­tion with the world. Thus, one good first step for mak­ing ei­ther of the above trans­parency schemes go through—since both of them in­volve world-model ex­trac­tion—might be to gain a bet­ter un­der­stand­ing of what trans­par­ent world mod­els look like and how they can be ex­tracted.[18]

Other approaches

This sec­tion is not meant to be a com­plete list of pos­si­ble al­ter­na­tive strate­gies to solv­ing this prob­lem, but rather an anal­y­sis of some other promis­ing di­rec­tions that were looked into as part of the re­search that went into this post.

Call di­rectly. As I noted pre­vi­ously, one pos­si­ble could sim­ply be , in which case sim­ply re­duces to Thus, a rea­son­able ques­tion to ask is whether we should sim­ply use this as our ac­cept­abil­ity penalty di­rectly—that is, how much do we gain from hav­ing an ad­ver­sary? There are two rea­sons we might want to ac­tu­ally have an ad­ver­sary.

First, the use of an ad­ver­sary might al­low us to in­ject ad­di­tional use­ful in­for­ma­tion into the sys­tem. Speci­fi­cally, we are giv­ing the sys­tem the fol­low­ing in­for­ma­tion: the fact that it should be us­ing an ad­ver­sar­ial train­ing setup, and the spe­cific pseudo-in­put set that it should be con­sid­er­ing. In par­tic­u­lar, our abil­ity to spec­ify gives us the abil­ity to con­strain the ad­ver­sary such that it is forced to con­fine its demon­stra­tion into some struc­ture that en­ables it to be eas­ily checked by . In prac­tice, how­ever, we might want to im­ple­ment this sim­ply by let­ting am­plifi­ca­tion de­com­pose the origi­nal ques­tion into the ad­ver­sar­ial set­ting it­self, but even in that case it’s still worth an­a­lyz­ing why or why not am­plifi­ca­tion might want to adopt that par­tic­u­lar de­com­po­si­tion.

Se­cond, the de­com­po­si­tion into ad­ver­sary and over­seer might sig­nifi­cantly aid train­ing as it al­lows us to train the ad­ver­sary on the re­ward sig­nal of max­i­miz­ing rather than just rely­ing on the re­ward sig­nal from am­plifi­ca­tion. Fur­ther­more, if we’re not us­ing a stan­dard am­plifi­ca­tion setup as in the pure world model trans­parency case de­scribed pre­vi­ously, then we might need this ad­di­tional train­ing sig­nal as we wouldn’t nec­es­sar­ily get it for free out of de­com­po­si­tions any­more in that regime.

Im­ple­ment the ad­ver­sary as a sep­a­rate sys­tem. Another way in which the use of an ad­ver­sary could be made to sig­nifi­cantly im­prove upon just call­ing di­rectly is to have the ad­ver­sary im­ple­mented as a sep­a­rate sys­tem. One ma­jor prob­lem with such a setup, how­ever, is that we then need an­other sys­tem for guaran­tee­ing the ac­cept­abil­ity of the ad­ver­sary. If we re­strict the ad­ver­sary’s task suffi­ciently such that its in­put data isn’t rich enough for it to learn com­plex de­cep­tive be­hav­ior, how­ever, then pro­duc­ing such an ac­cept­abil­ity guaran­tee might be suffi­ciently eas­ier than the origi­nal prob­lem that it pro­vides an ac­tual re­duc­tion. For ex­am­ple, we could train to solely be a the­o­rem prover that pro­duces proofs of the form where we in­cen­tivize to se­lect its as­sump­tions to max­i­mize .

Make use of ‘s model. One as­sump­tion we are rely­ing on in this setup is that have a model of . It is in­ter­est­ing to ex­plore what else one can do with other ver­sions of that as­sump­tion. In par­tic­u­lar, if one as­sumes that can ac­cess ‘s model of and , then it might be pos­si­ble to turn that into an iid guaran­tee. One sim­ple way of do­ing this might be to only de­ploy in situ­a­tions where has high prob­a­bil­ity ac­cord­ing to ’s model of . Alter­na­tively, such a con­di­tion could even be in­te­grated into the train­ing pro­cess if is trained to match its model of to its model of such that it is trained to “for­get” about any dis­tri­bu­tional shift it might oth­er­wise know about. Im­por­tantly, how­ever, ap­proaches of this form can only pro­duce an iid guaran­tee, which must be com­bined with some anal­y­sis of the form pre­sented pre­vi­ously to turn it into an ac­cept­abil­ity guaran­tee.

Ensem­ble over many . Another ap­proach which is not ex­plored in this post is the use of en­sem­bling to im­prove ac­cept­abil­ity. If en­sem­ble con­sen­sus is re­quired to pro­duce an out­put, then as long as there is some in the en­sem­ble, its out­puts should always be guaran­teed to be ac­cept­able. Thus, en­sem­bling can re­duce the bur­den on us hav­ing to pro­duce pre­cisely in ex­change for the cost of spend­ing ad­di­tional com­pute train­ing ad­di­tional . Im­por­tantly, how­ever, this only works if the differ­ent are suffi­ciently dis­tinct that the prob­a­bil­ity of any one of them be­ing good is rel­a­tively in­de­pen­dent from the prob­a­bil­ity of the oth­ers be­ing good, which might be a difficult con­di­tion to satisfy in prac­tice. Fur­ther­more, en­sem­bling is less a solu­tion in and of it­self and more of a strat­egy we can use to im­prove upon other ac­cept­abil­ity guaran­tees, as we still need to ac­tu­ally have some ar­gu­ment for why we will be likely enough to find in the first place that en­sem­bling can get us the rest of the way there.


Though this post goes most in-depth into the par­tic­u­lar pro­posal of re­laxed ad­ver­sar­ial train­ing, it is my hope that many of the ideas that have been dis­cussed here will be broadly use­ful for ad­vanc­ing our un­der­stand­ing of differ­ent pos­si­ble ap­proaches to solv­ing in­ner al­ign­ment in gen­eral. In par­tic­u­lar, I be­lieve that one of the best ways in which we can get trac­tion now on how to solve hard AGI al­ign­ment prob­lems such as the ones dis­cussed here is to an­a­lyze con­crete schemes and hard cases so that we can get an idea now of what sorts of prob­lems we’re likely to en­counter in the fu­ture on the road to AGI.

To that end, I be­lieve that one of the most im­por­tant take­aways we can draw from the anal­y­sis pre­sented here, re­gard­less of what sort of ap­proach we ac­tu­ally end up us­ing, is the cen­tral im­por­tance of trans­parency. Without be­ing able to look in­side our model to a sig­nifi­cant de­gree, it is likely go­ing to be very difficult to get any sort of mean­ingful ac­cept­abil­ity guaran­tees. Even if we are only shoot­ing for an iid guaran­tee, rather than a worst-case guaran­tee, we are still go­ing to need some way of look­ing in­side our model to ver­ify that it doesn’t fall into any of the other hard cases. One of the ways in which this sort of work can be most helpful, there­fore, is in pro­vid­ing re­search di­rec­tions for trans­parency: what is it that we ac­tu­ally need to get out of trans­parency to be able to pro­duce a good ac­cept­abil­ity guaran­tee?

This work sug­gests, for ex­am­ple, that one of the crit­i­cal con­di­tion that we need is for our trans­parency scheme to be good enough such that there ac­tu­ally ex­ists some model that can demon­strate to an over­seer that it is ac­cept­able. Thus, we need to an­swer the ques­tion of what would be suffi­cient to con­vince us (or at least an am­plified ver­sion of us) that a model is ac­cept­able. One pos­si­ble first step to­wards some­thing of this form might be work­ing on world model ex­trac­tion as de­scribed pre­vi­ously, though sig­nifi­cant fu­ture work—likely in­volv­ing ma­jor ad­vances re­gard­ing our un­der­stand­ing of in­ter­pretable world mod­els—still seems nec­es­sary to make any­thing of that form go through.

  1. For a fur­ther treat­ment of the ques­tion of what a good ac­cept­abil­ity guaran­tee might look like, see “Towards a mechanis­tic un­der­stand­ing of cor­rigi­bil­ity.” ↩︎

  2. See the “Hard cases” sec­tion for an anal­y­sis of ways in which a guaran­tee of this form might be in­suffi­cient and/​or difficult to ob­tain. ↩︎

  3. Note that there are still a lot of un­re­solved open ques­tions re­gard­ing in­ner al­ign­ment. “Risks from Learned Op­ti­miza­tion” at­tempts to ad­dress many of these ques­tions from a the­o­ret­i­cal per­spec­tive, though ad­di­tional ex­per­i­men­tal ev­i­dence would also be use­ful to fur­ther pin down this prob­lem. ↩︎

  4. See the “Con­di­tional defec­tion” hard case for more ex­am­ples of in­puts that might lead to un­ac­cept­able be­hav­ior which might be very difficult to in­stan­ti­ate dur­ing train­ing. ↩︎

  5. This could be pure su­per­vised am­plifi­ca­tion, de­bate, or any­thing in-be­tween. For sim­plic­ity, I mostly think about the pure su­per­vised case wherein is im­ple­mented sim­ply by ask­ing the ques­tion to a hu­man with the abil­ity to con­sult the model. ↩︎

  6. Note that this still as­sumes is bi­nary—which is not quite right since we might care about some un­ac­cept­able ac­tions more than oth­ers—though that can also eas­ily be re­laxed by mov­ing to an ex­pec­ta­tion rather than a prob­a­bil­ity. ↩︎

  7. Note that this ap­prox­i­ma­tion is tight if and only if there ex­ists some such that . ↩︎

  8. One con­cern with this defi­ni­tion is that prob­a­bil­ities are not in gen­eral well-defined over log­i­cal state­ments. To re­solve this prob­lem, we can in­stead ask some­thing of the form “What is your break-even price for a con­tract that pays out $1 if and are true but that is fully re­funded if is false?” ↩︎

  9. Alter­na­tively, if we want to think of de­ploy­ment as a con­crete set of in­puts rather than a dis­tri­bu­tion, we can set . ↩︎

  10. Note that this is a con­ser­va­tive prior. In prac­tice, we might hope for ran­dom un­ac­cept­able be­hav­ior to be un­likely enough and difficult enough that we can get a sig­nifi­cant ad­van­tage out of our prior rather than sim­ply as­sum­ing unifor­mity. ↩︎

  11. One ap­peal­ing idea for how we might be able to satisfy it is if we use some sort of on­line learn­ing scheme. This doesn’t ac­tu­ally solve the prob­lem, how­ever—the differ­ence be­tween and here is just that un­ac­cept­able be­hav­ior in is re­cov­er­able but un­ac­cept­able be­hav­ior in is not. Even if we do on­line learn­ing, the sys­tem will pre­sum­ably still have to be de­ployed in situ­a­tions where it will be able to take ac­tions that would be un­re­cov­er­able if un­ac­cept­able. How­ever, on­line learn­ing might help us in­crease the num­ber of un­ac­cept­able ac­tions re­quired be­fore the situ­a­tion be­comes un­re­cov­er­able, which is ex­actly the fix an­a­lyzed here. ↩︎

  12. Speci­fi­cally, the ex­per­i­ments were con­ducted by us­ing a black-box op­ti­mizer to se­lect so as to max­i­mize . ↩︎

  13. The trends given here are highly ap­prox­i­mate. Sig­nifi­cantly bet­ter ap­prox­i­ma­tions are given by and . ↩︎

  14. Why you would ever ac­tu­ally get a model of this form is a com­pli­cated ques­tion; for a com­plete treat­ment, see the sec­tion on de­cep­tive al­ign­ment in “Risks from Learned Op­ti­miza­tion.” ↩︎

  15. In par­tic­u­lar, it might be that is iid rel­a­tive to , but still be­comes more likely when con­di­tion­ing on be­ing im­por­tant to us. ↩︎

  16. Note that here is not the min­i­mum num­ber of steps be­fore can be­come un­ac­cept­able, but rather the min­i­mum num­ber of steps be­fore can de­ceive into think­ing it’s ac­cept­able when it’s not. ↩︎

  17. Though no more mechanis­tic ap­proach cur­rently ex­ists, I am ex­cited about fu­ture re­search pos­si­bly yield­ing progress in this do­main. ↩︎

  18. Much re­search in the di­rec­tion of ex­tract­ing in­ter­pretable world mod­els already ex­ists—such as Carter et al.’s Ac­ti­va­tion At­lases, for ex­am­ple—though it is cur­rently un­clear to what ex­tent such tech­niques will scale to more pow­er­ful AI sys­tems. ↩︎