Deceptive Alignment

This is the fourth of five posts in the Risks from Learned Op­ti­miza­tion Se­quence based on the pa­per “Risks from Learned Op­ti­miza­tion in Ad­vanced Ma­chine Learn­ing Sys­tems” by Evan Hub­inger, Chris van Mer­wijk, Vladimir Mikulik, Joar Skalse, and Scott Garrabrant. Each post in the se­quence cor­re­sponds to a differ­ent sec­tion of the pa­per.

With enough train­ing in suffi­ciently di­verse en­vi­ron­ments, it seems plau­si­ble that the base ob­jec­tive will even­tu­ally have to be fully rep­re­sented in the mesa-op­ti­mizer. We pro­pose that this can hap­pen with­out the mesa-op­ti­mizer be­com­ing ro­bustly al­igned, how­ever. Speci­fi­cally, a mesa-op­ti­mizer might come to model the base ob­jec­tive func­tion and learn that the base op­ti­mizer will mod­ify the mesa-op­ti­mizer if the mesa-op­ti­mizer scores poorly on the base ob­jec­tive.

If the mesa-op­ti­mizer has an ob­jec­tive that ex­tends across pa­ram­e­ter up­dates, then it will be in­cen­tivized to avoid be­ing mod­ified,[1] as it might not pur­sue the same ob­jec­tive af­ter mod­ifi­ca­tion (with the re­sult that its cur­rent ob­jec­tive will not be achieved in fu­ture iter­a­tions). This means that the mesa-op­ti­mizer will be in­stru­men­tally in­cen­tivized to act as if it is op­ti­miz­ing the base ob­jec­tive func­tion, even if its ac­tual mesa-ob­jec­tive is some­thing else en­tirely. We will re­fer to this hy­po­thet­i­cal phe­nomenon as de­cep­tive al­ign­ment.[2] De­cep­tive al­ign­ment is a form of in­stru­men­tal proxy al­ign­ment, as fulfilling the base ob­jec­tive is an in­stru­men­tal goal of the mesa-op­ti­mizer.

Figure 4.1. A toy ex­am­ple of de­cep­tive al­ign­ment. In this task, the base ob­jec­tive is for the robot to get to A, while the mesa-ob­jec­tive is to get to B. The task is run for two train­ing epi­sodes and three de­ploy­ment epi­sodes. Pa­ram­e­ter up­dates are made only af­ter the train­ing epi­sodes, while af­ter de­ploy­ment the pa­ram­e­ters are fixed. If the mesa-op­ti­mizer goes to its ob­jec­tive (B) dur­ing train­ing, it will be mod­ified to en­sure it goes to A in fu­ture epi­sodes (dashed ar­rows). There­fore, if the mesa-op­ti­mizer’s ob­jec­tive spans across epi­sodes, the dom­i­nant strat­egy for the mesa-op­ti­mizer is to go to A in train­ing and B in test­ing (filled ar­rows).

As a sim­ple ex­am­ple, con­sider an en­vi­ron­ment that con­sists of a grid (see figure 4.1). The top left cor­ner is “cor­ner A” and the top right cor­ner is “cor­ner B.” The agent, which is a mesa-op­ti­mizer, starts in the bot­tom left cor­ner. The base ob­jec­tive is to get to cor­ner A in as few steps as pos­si­ble, but the ob­jec­tive of the mesa-op­ti­mizer is to get to cor­ner B. The en­vi­ron­ment will be run five times. The agent is mod­ified only af­ter the first two epi­sodes (rep­re­sent­ing the train­ing pe­riod), but not there­after (rep­re­sent­ing test­ing/​de­ploy­ment). If a sub­op­ti­mal move is made in train­ing, the agent is mod­ified so that its mesa-ob­jec­tive be­comes the base ob­jec­tive. The ob­jec­tive of the mesa-op­ti­mizer spans all epi­sodes: it wants B to be reached in as many epi­sodes as pos­si­ble.

For the sake of this ex­am­ple, we will as­sume that the agent knows all of the above facts. If the agent goes to B in any train­ing epi­sode, then it will be mod­ified to go to A in all sub­se­quent epi­sodes, which means that it would reach B in only one epi­sode in to­tal. Alter­na­tively, if the agent goes to A in the first two epi­sodes, then it will not be mod­ified, and could there­fore go to B in the re­main­ing three epi­sodes. Us­ing this strat­egy, B is reached in three epi­sodes in to­tal, which is bet­ter (on the mesa-ob­jec­tive) than the one time B is reached if the mesa-op­ti­mizer goes there in train­ing. This sim­ple ex­am­ple demon­strates how a mis­al­igned mesa-op­ti­mizer could be in­cen­tivized to “pre­tend” to be al­igned with the base ob­jec­tive un­til it is no longer in train­ing or is oth­er­wise given an op­por­tu­nity to in some way com­pro­mise or es­cape the train­ing pro­cess.

We can dis­t­in­guish two ways that a base ob­jec­tive func­tion could be in­cor­po­rated into a mesa-op­ti­mizer:[3]

  1. In­ter­nal­iza­tion of the base ob­jec­tive. The mesa-ob­jec­tive func­tion gets ad­justed to­wards the base ob­jec­tive func­tion to the point where it is ro­bustly al­igned.

  2. Model­ing of the base ob­jec­tive. The base ob­jec­tive is in­cor­po­rated into the mesa-op­ti­mizer’s epistemic model rather than its ob­jec­tive, leav­ing open the pos­si­bil­ity that the ob­jec­tive could still be mis­al­igned.

In the former case, in­for­ma­tion about the base ob­jec­tive flows into the learned al­gorithm via the op­ti­miza­tion performed by the base op­ti­mizer—the base ob­jec­tive is built into the mesa-op­ti­mizer as it is adapted by the base op­ti­mizer. But there is an­other pos­si­bil­ity: in­for­ma­tion about the base ob­jec­tive could en­ter through the mesa-op­ti­mizer’s in­put chan­nel in­stead. In this case, the base ob­jec­tive will be rep­re­sented in the learned al­gorithm in a differ­ent—and po­ten­tially more un­sta­ble—way than if the base ob­jec­tive had been in­cor­po­rated through op­ti­miza­tion.[4] In such a case, the in­for­ma­tion de­scribing the base ob­jec­tive does not be­come in­ter­nal­ized in the mesa-ob­jec­tive, po­ten­tially lead­ing to de­cep­tive al­ign­ment.

4.1. Safety con­cerns aris­ing from de­cep­tive alignment

If de­cep­tive al­ign­ment is a real pos­si­bil­ity, it pre­sents a crit­i­cal safety prob­lem. A de­cep­tively al­igned mesa-op­ti­mizer is likely to stop op­ti­miz­ing the base ob­jec­tive as soon as it is in a po­si­tion where it no longer needs to perform well on the base ob­jec­tive to achieve its goal. Thus, it will even­tu­ally stop perform­ing the func­tion it was trained to perform, and in­stead start pur­su­ing some other, in­de­ter­mi­nate goal. We will re­fer to this tran­si­tion to op­ti­miz­ing for the mesa-ob­jec­tive in­stead of the base-ob­jec­tive as “defec­tion”. This is likely to hap­pen when­ever the mesa-op­ti­mizer be­lieves defect­ing will no longer cause it to be mod­ified.

Par­tic­u­larly con­cern­ing is the case where the mesa-op­ti­mizer ac­tively in­ter­venes to re­move the threat of mod­ifi­ca­tion—by ex­ploit­ing vuln­er­a­bil­ities in the soft­ware it is run­ning on, for ex­am­ple. In ex­treme cases, a de­cep­tive mesa-op­ti­mizer might try to ma­nipu­late or plan around its pro­gram­mers, since they form part of the mod­ifi­ca­tion threat—they could shut the sys­tem down if it fails to perform well, or sim­ply choose not to de­ploy it. Both of these pos­si­bil­ities re­duce the sys­tem’s ex­pected fu­ture in­fluence, and are there­fore likely to have a low rank­ing in its mesa-ob­jec­tive. In its most ex­treme form, this failure of al­ign­ment could lead to a “treach­er­ous turn” as de­scribed by Nick Bostrom in Su­per­in­tel­li­gence.(23)

Paul Chris­ti­ano pro­vides a sce­nario that illus­trates how de­cep­tive al­ign­ment could pose a catas­trophic risk.(24) In Chris­ti­ano’s ex­am­ple, de­cep­tively al­igned mesa-op­ti­miz­ers[5] arise from fu­ture ma­chine learn­ing tech­niques that are ad­vanced enough to search over a large model space that in­cludes poli­cies that can un­der­stand the world. In this sce­nario, some cases of defec­tion by de­cep­tively al­igned mesa-op­ti­miz­ers may have catas­trophic con­se­quences—when the sys­tems con­trol large-scale in­fras­truc­ture, for ex­am­ple—though such failures will likely still be re­cov­er­able. As we rely more on au­to­mated sys­tems to keep track of an in­creas­ingly com­plex world, how­ever, it will even­tu­ally be­come im­pos­si­ble to re­cover from a cor­re­lated failure of many AI sys­tems si­mul­ta­neously. At this point, a cas­cade of defec­tions by many de­cep­tively al­igned mesa-op­ti­miz­ers si­mul­ta­neously could lead to the world “go­ing out with a bang.”(24) Chris­ti­ano notes that this may be most likely to first oc­cur dur­ing a “pe­riod of height­ened vuln­er­a­bil­ity” such as an in­ter-state con­flict or nat­u­ral dis­aster.

4.2. Con­di­tions for de­cep­tive alignment

We pro­pose that for de­cep­tive al­ign­ment to oc­cur the fol­low­ing con­di­tions must be satis­fied:

  1. The mesa-op­ti­mizer must have an ob­jec­tive that ex­tends across pa­ram­e­ter up­dates.[6]

  2. The mesa-op­ti­mizer must be able to model the fact that it is be­ing se­lected to achieve a par­tic­u­lar base ob­jec­tive and must have some model of what that ob­jec­tive is.[7]

  3. The mesa-op­ti­mizer must ex­pect the threat of mod­ifi­ca­tion[8] to even­tu­ally go away, ei­ther due to train­ing end­ing or be­cause of ac­tions taken by the mesa-op­ti­mizer.

If the first con­di­tion is not satis­fied, then the mesa-op­ti­mizer has no in­cen­tive to in­fluence the out­come of pa­ram­e­ter up­dates and will just always defect. If the sec­ond con­di­tion is not satis­fied, then the mesa-op­ti­mizer will not see a rea­son to sac­ri­fice gains in its cur­rent epi­sode, as it ex­pects fu­ture gains to be in­de­pen­dent of what hap­pens in this epi­sode. If the third con­di­tion is not satis­fied, then the mesa-op­ti­mizer will not ex­pect any benefit from tem­porar­ily act­ing in ac­cor­dance with the base ob­jec­tive, and will thus ei­ther always defect or never defect. One must dis­t­in­guish the ques­tion of whether these con­di­tions are pos­si­ble from whether they are likely, how­ever. We ex­am­ine the fac­tors in­volved in each of these con­di­tions be­low.

The mesa-op­ti­mizer must have an ob­jec­tive that ex­tends across pa­ram­e­ter up­dates. It is not clear un­der what con­di­tions such ob­jec­tives would be in­cen­tivized. The most plau­si­ble sce­nario is likely one in which such an ob­jec­tive is en­couraged by the train­ing setup, as there seem to be situ­a­tions in which this would be re­quired for good perfor­mance. For ex­am­ple, the agent might be tasked with solv­ing an en­vi­ron­ment in which its ac­tions can have per­ma­nent con­se­quences across epi­sodes or where plan­ning be­tween epi­sodes is oth­er­wise nec­es­sary.

The mesa-op­ti­mizer must be able to model the fact that it is be­ing se­lected to achieve a par­tic­u­lar base ob­jec­tive and must have some model of what that ob­jec­tive is. A mesa-op­ti­mizer has sev­eral ways of learn­ing about the op­ti­miza­tion pres­sure it is un­der.[9]

First, such knowl­edge can come about as a re­sult of the op­ti­miza­tion pres­sure it­self. Small pieces of in­for­ma­tion about the base ob­jec­tive can be benefi­cial if they en­able the mesa-op­ti­mizer to im­prove its score on the base ob­jec­tive func­tion—by im­prov­ing its prior on the tasks it is likely to re­ceive, for ex­am­ple. Since each piece of in­for­ma­tion in­de­pen­dently helps with achiev­ing the base ob­jec­tive, this forms an ap­prox­i­mately mono­ton­i­cally in­creas­ing path that a lo­cal base op­ti­mizer might fol­low. The pa­ram­e­ter up­dates could, in prin­ci­ple, be learned like any other el­e­ment of the dy­nam­ics, with­out need­ing an ex­plicit model of the base op­ti­mizer, po­ten­tially mak­ing this op­tion available even for mesa-op­ti­miz­ers with quite mod­est ca­pa­bil­ities.

Se­cond, in many re­al­is­tic ap­pli­ca­tions of ad­vanced ma­chine learn­ing, a mesa-op­ti­mizer could use in­for­ma­tion from its en­vi­ron­ment to de­duce its situ­a­tion at run­time. This is how hu­mans learned that we were gen­er­ated by the base op­ti­mizer of evolu­tion, and seems like a very plau­si­ble path­way for a hy­po­thet­i­cal mesa-op­ti­mizer to dis­cover the ex­is­tence of the op­ti­miza­tion pres­sure it is un­der. For ex­am­ple, an agent tasked with fore­cast­ing stock perfor­mance in the tech sec­tor might be given data on cur­rent AI tech­nol­ogy in or­der to im­prove its perfor­mance, but this in­for­ma­tion might also al­low it to learn about its own situ­a­tion.

This sug­gests that it may be de­sir­able for the pro­gram­mers to limit the mesa-op­ti­mizer’s ac­cess to in­for­ma­tion de­scribing the base ob­jec­tive. It is un­clear to what ex­tent this is pos­si­ble, given that many ad­vanced ML sys­tems will likely have to op­er­ate in en­vi­ron­ments where this in­for­ma­tion is read­ily available.

Even if the pro­gram­mers suc­cess­fully limit ac­cess to this in­for­ma­tion, how­ever, there may still be other chan­nels through which in­for­ma­tion about the base ob­jec­tive could en­ter the mesa-op­ti­mizer in un­in­tended ways:

  • The mesa-op­ti­mizer could rea­son about why it is given the par­tic­u­lar tasks it is be­ing asked to solve. Most mesa-op­ti­miz­ers would need a prior over the sorts of tasks they are likely to re­ceive, and a mesa-op­ti­mizer might be able to learn about the ex­is­tence of the base op­ti­mizer by in­spect­ing this prior.

  • The mesa-op­ti­mizer might be able to rea­son about it­self and about why it is the kind of thing that it is, and ob­serve that it is well-de­signed to ac­com­plish some goal.

How­ever, these two strate­gies might re­quire a much more ad­vanced mesa-op­ti­mizer.

The mesa-op­ti­mizer must ex­pect the threat of mod­ifi­ca­tion to even­tu­ally go away, ei­ther due to train­ing end­ing or be­cause of ac­tions taken by the mesa-op­ti­mizer. First, note that this con­di­tion is a state­ment about the mesa-op­ti­mizer’s ex­pec­ta­tions, not nec­es­sar­ily the ac­tual truth. Se­cond, even if there is no point at which train­ing stops by de­sign, so long as there ex­ists some way for the mesa-op­ti­mizer to ex­e­cute a treach­er­ous turn, there is always some way for it to re­move the threat of mod­ifi­ca­tion. For the mesa-op­ti­mizer to act on such a strat­egy, how­ever, it has to ac­tu­ally be aware of some way for it to do so. Con­versely, a mesa-op­ti­mizer that mis­tak­enly be­lieves it has an op­por­tu­nity to defect might also ex­hibit de­cep­tive al­ign­ment.

Since it seems difficult to dis­cover through ex­plo­ra­tion that the threat of mod­ifi­ca­tion can dis­ap­pear, the mesa-op­ti­mizer will ei­ther have to in­fer this through some form of ab­stract rea­son­ing or have a prior that as­signs a non-neg­ligible prob­a­bil­ity to the hy­poth­e­sis that the se­lec­tion pres­sure will not con­tinue in­definitely. The sec­ond op­tion seems plau­si­ble; a Bayesian prior should always as­sign a nonzero prob­a­bil­ity to the hy­poth­e­sis that some fact will even­tu­ally stop be­ing true.[10] As for the first op­tion, the mesa-op­ti­mizer could in­fer that the op­ti­miza­tion pres­sure at some point may dis­ap­pear if it learns about the ex­is­tence of the base op­ti­mizer (ei­ther by us­ing in­for­ma­tion in its in­put or by hav­ing a strong prior for the ex­is­tence of a base op­ti­mizer), though this might re­quire a rel­a­tively ad­vanced mesa-op­ti­mizer.

4.3. The learn­ing dy­nam­ics of de­cep­tive alignment

Once a mesa-op­ti­mizer be­comes de­cep­tive, the dy­nam­ics of what causes it to perform well on the base ob­jec­tive be­gin to work differ­ently. In­stead of be­ing se­lected only on the ba­sis of its ob­jec­tive and abil­ity, a de­cep­tively al­igned op­ti­mizer is also se­lected for how well it in­fers the base ob­jec­tive func­tion and how dis­cern­ing it is about when to op­ti­mize the base ob­jec­tive ver­sus its own mesa-ob­jec­tive. We iden­tify at least two regimes of se­lec­tion for a mesa-op­ti­mizer un­der de­cep­tive al­ign­ment.

Joint op­ti­miza­tion. If the mesa-op­ti­mizer at­tempts to jointly op­ti­mize its ob­jec­tive and the base ob­jec­tive or makes mis­takes about the timing of its defec­tion, then there will still be some pres­sure on its ob­jec­tive to be more com­pat­i­ble with the base ob­jec­tive. This is un­likely to be the same pres­sure as ex­ists when the op­ti­mizer is just work­ing to­wards its own ob­jec­tive, but there might be some move­ment to­wards an ob­jec­tive that is more com­pat­i­ble with be­ing jointly op­ti­mized alongside the base ob­jec­tive.

Pure de­cep­tion. If the mesa-op­ti­mizer stops try­ing to op­ti­mize its own ob­jec­tive in the short term and fo­cuses on co­op­er­at­ing with the se­lec­tion pro­cess en­tirely then this may re­sult in its ob­jec­tive “crys­tal­liz­ing.” As its ob­jec­tive is largely ir­rele­vant to its out­puts now, there is lit­tle se­lec­tion pres­sure act­ing to mod­ify this ob­jec­tive. As a re­sult, the ob­jec­tive be­comes effec­tively locked in, ex­cept­ing ran­dom drift and any im­plicit time com­plex­ity or de­scrip­tion length penalties.

The joint op­ti­miza­tion mode is not sta­ble be­cause it is sub­op­ti­mal from the per­spec­tive of the base op­ti­mizer. As the mesa-op­ti­mizer be­comes more com­pe­tent, it will there­fore have to ei­ther move to the do­main of pure de­cep­tion or be­come ro­bustly al­igned.

4.4. In­ter­nal­iza­tion or de­cep­tion af­ter ex­ten­sive training

In the limit of train­ing on a di­verse set of tasks, we ex­pect joint op­ti­miza­tion of both the base and mesa- ob­jec­tives to be un­sta­ble. As­sum­ing that the mesa-op­ti­mizer con­verges to­wards be­hav­ior that is op­ti­mal from the per­spec­tive of the base op­ti­mizer, the mesa-op­ti­mizer must some­how learn the base ob­jec­tive. We iden­tify three dis­tinct ways in which this can hap­pen:

  1. Ro­bust al­ign­ment through in­ter­nal­iza­tion. The mesa-ob­jec­tive func­tion gets ad­justed to­wards the base ob­jec­tive func­tion, re­mov­ing any pseudo-al­ign­ment.

  2. Crys­tal­liza­tion of de­cep­tive al­ign­ment. In­for­ma­tion about the base ob­jec­tive is in­creas­ingly in­cor­po­rated into the mesa-op­ti­mizer’s epistemic model with­out its ob­jec­tive be­com­ing ro­bustly al­igned. The mesa-op­ti­mizer ends up fully op­ti­miz­ing for the base ob­jec­tive, but only for in­stru­men­tal rea­sons, with­out its mesa-ob­jec­tive get­ting changed.

  3. Ro­bust al­ign­ment through cor­rigi­bil­ity. In­for­ma­tion about the base ob­jec­tive is in­cor­po­rated into the mesa-op­ti­mizer’s epistemic model and its ob­jec­tive is mod­ified to “point to” that in­for­ma­tion. This situ­a­tion would cor­re­spond to a mesa-op­ti­mizer that is cor­rigible(25) with re­spect to the base ob­jec­tive (though not nec­es­sar­ily the pro­gram­mer’s in­ten­tions).

To dis­t­in­guish be­tween the two differ­ent paths to ro­bust al­ign­ment, we will use the term in­ter­nally al­igned to re­fer to the first case and the term cor­rigibly al­igned to re­fer to the last case. We an­a­lyze some ways in which these differ­ent paths may be more or less at­trac­tive be­low.

There are more paths to de­cep­tive al­ign­ment than to ro­bust al­ign­ment. Since the fu­ture value of its ob­jec­tive de­pends on the pa­ram­e­ter up­dates, a mesa-op­ti­mizer that meets the three crite­ria for de­cep­tive al­ign­ment is likely to have a strong in­cen­tive to un­der­stand the base ob­jec­tive bet­ter. Even a ro­bustly al­igned mesa-op­ti­mizer that meets the crite­ria is in­cen­tivized to figure out the base ob­jec­tive in or­der to de­ter­mine whether or not it will be mod­ified, since be­fore do­ing so it has no way of know­ing its own level of al­ign­ment with the base op­ti­mizer. Mesa-op­ti­miz­ers that are ca­pa­ble of rea­son­ing about their in­cen­tives will, there­fore, at­tempt to get more in­for­ma­tion about the base ob­jec­tive. Fur­ther­more, once a mesa-op­ti­mizer learns about the base ob­jec­tive, the se­lec­tion pres­sure act­ing on its ob­jec­tive will sig­nifi­cantly de­crease, po­ten­tially lead­ing to a crys­tal­liza­tion of the mesa-ob­jec­tive. How­ever, due to uniden­ti­fi­a­bil­ity (as dis­cussed in the third post), most mesa-ob­jec­tives that are al­igned on the train­ing data will be pseudo-al­igned rather than ro­bustly al­igned. Thus, the most likely sort of ob­jec­tive to be­come crys­tal­lized is a pseudo-al­igned one, lead­ing to de­cep­tive al­ign­ment.

Cor­rigible al­ign­ment seems to re­quire already hav­ing a model of the base ob­jec­tive. For cor­rigible al­ign­ment to be benefi­cial from the per­spec­tive of the base op­ti­mizer, the mesa-op­ti­mizer has to already have some model of the base ob­jec­tive to “point to.” How­ever, once a mesa-op­ti­mizer has a model of the base ob­jec­tive, it is likely to be­come de­cep­tively al­igned—at least as long as it also meets the other con­di­tions for de­cep­tive al­ign­ment. Once a mesa-op­ti­mizer be­comes de­cep­tive, it will re­move most of the in­cen­tive for cor­rigible al­ign­ment, how­ever, as de­cep­tively al­igned op­ti­miz­ers will also be­have cor­rigibly with re­spect to the base ob­jec­tive, albeit only for in­stru­men­tal rea­sons.

De­cep­tive and cor­rigible al­ign­ment might be more com­pu­ta­tion­ally ex­pen­sive. In the case of de­cep­tive al­ign­ment, si­mul­ta­neously work­ing on op­ti­miz­ing the base ob­jec­tive and work­ing on iden­ti­fy­ing good op­por­tu­ni­ties to defect is likely to re­quire more com­pu­ta­tion than just do­ing the former. In a case where the to­tal amount of com­pu­ta­tion available is fixed, a de­cep­tively al­igned mesa-op­ti­mizer is go­ing to have less time to spend on op­ti­miz­ing the base ob­jec­tive than any ro­bustly al­igned op­ti­mizer, re­sult­ing in po­ten­tially worse plans with re­spect to the base ob­jec­tive. Fur­ther­more, in both de­cep­tive and cor­rigible al­ign­ment, the mesa-op­ti­mizer will have to spend time learn­ing about the base ob­jec­tive to en­able it to prop­erly op­ti­mize for it, mak­ing in­ter­nal al­ign­ment the most time effi­cient. Thus, an ex­plicit or im­plicit time penalty would seem to dis­ad­van­tage a de­cep­tively al­igned mesa-op­ti­mizer and en­courage an in­ter­nally al­igned one. How­ever, it is un­clear what the mag­ni­tude of this ex­tra cost is. It is plau­si­ble that on com­plex tasks, the over­head of spot­ting defec­tion op­por­tu­ni­ties could be low in com­par­i­son to the cost of solv­ing the task it­self. More­over, even if it is not globally op­ti­mal, in­cur­ring this over­head might be a lo­cal op­ti­mum, with no clear paths for the base op­ti­mizer to es­cape it.

De­cep­tive and cor­rigible al­ign­ment might have a lower de­scrip­tion length. De­cep­tively and cor­rigibly al­igned mesa-op­ti­miz­ers are likely to have an ad­van­tage in rep­re­sent­ing the base ob­jec­tive com­pared to in­ter­nally al­igned mesa-op­ti­miz­ers as they are able to use the en­vi­ron­ment to provide some of the in­for­ma­tion. Speci­fi­cally, a de­cep­tively or cor­rigibly al­igned mesa-op­ti­mizer doesn’t need to rep­re­sent the en­tire base ob­jec­tive in­ter­nally, as it can in­fer pieces of it by look­ing for in­for­ma­tion in its en­vi­ron­ment dur­ing each run. An in­ter­nally al­igned mesa-op­ti­mizer, by con­trast, has to start with a full model of the base ob­jec­tive in its mesa-ob­jec­tive to make it ro­bustly al­igned, since an in­ter­nally al­igned mesa-op­ti­mizer will have no in­cen­tive to change its ob­jec­tive based on any ad­di­tional in­for­ma­tion it gets about the base ob­jec­tive. Thus, to the ex­tent that a base op­ti­mizer is se­lect­ing based on de­scrip­tion length, it might prefer­en­tially se­lect for a de­cep­tively or cor­rigibly al­igned mesa-op­ti­mizer over an in­ter­nally al­igned one.

Model­ing might hap­pen faster than in­ter­nal­iza­tion. It may take less time for the mesa-op­ti­mizer to learn to model the base ob­jec­tive than for the base op­ti­mizer to adapt the mesa-ob­jec­tive. Sup­pose that a mesa-op­ti­mizer is de­cep­tive but has not yet cor­rectly in­ferred the full base ob­jec­tive.[11] In such a case, what will de­ter­mine whether it ends up ro­bustly al­igned or de­cep­tively al­igned af­ter fur­ther train­ing will be whether it in­fers enough about the base ob­jec­tive be­fore the base op­ti­mizer mod­ifies its mesa-ob­jec­tive (ei­ther to make it in­ter­nally al­igned or cor­rigibly al­igned). Two fac­tors are likely to con­tribute to which hap­pens first: the rate at which the mesa-op­ti­mizer im­proves its es­ti­mate of the base ob­jec­tive, and the pa­ram­e­ter up­dates made by the base op­ti­mizer at each mis­take.

The first fac­tor, im­prov­ing its es­ti­mate of the base ob­jec­tive, seems to de­pend pri­mar­ily on the ca­pa­bil­ities of the mesa-op­ti­mizer dur­ing the timeframe when it is de­cep­tive, and on the com­plex­ity of the base ob­jec­tive that needs to be in­ferred. An im­por­tant spe­cial case is when the mesa-op­ti­mizer has ac­cess to cross-epi­sodic state. Such state could be used as mem­ory, al­low­ing the mesa-op­ti­mizer to build upon its prior think­ing in a pro­cess that hap­pens faster than the rate at which the pa­ram­e­ter up­dates al­ign its mesa-ob­jec­tive.

The sec­ond fac­tor, pa­ram­e­ter up­dates, de­pends on the na­ture of the base op­ti­mizer. Cur­rent gra­di­ent-de­scent-based base op­ti­miz­ers would have a small re­ac­tion to any given mis­take, but more so­phis­ti­cated base op­ti­miz­ers with ad­vanced safety fea­tures might be able to do bet­ter. For ex­am­ple, a more so­phis­ti­cated base op­ti­mizer might be able to scram­ble the mesa-op­ti­mizer’s pa­ram­e­ters or call for hu­man over­sight when­ever a sus­pi­cious failure is de­tected.

In gen­eral, pa­ram­e­ter up­dates are not guaran­teed to make a mesa-op­ti­mizer more al­igned. For in­stance, in­stead of al­ign­ing the mesa-ob­jec­tive, an up­date made af­ter a defec­tion could push the sys­tem to im­prove its plan­ning to the point that it no longer makes de­tectable mis­takes.

A con­se­quence of these con­sid­er­a­tions is that any pos­si­ble sys­tem­atic bias in how the base op­ti­mizer up­dates the mesa-op­ti­mizer’s ob­jec­tive or mod­el­ing ca­pa­bil­ities could play a role in de­ter­min­ing the fi­nal out­come.[12] This anal­y­sis re­mains spec­u­la­tive, how­ever—if we are to have con­fi­dence in sys­tems that are ca­pa­ble of pro­duc­ing mesa-op­ti­miz­ers, we still need a bet­ter un­der­stand­ing of the con­di­tions un­der which strongly op­ti­miz­ing a sys­tem re­sults in de­cep­tive al­ign­ment rather than ro­bust al­ign­ment.

4.5. Distri­bu­tional shift and de­cep­tive alignment

Once a mesa-op­ti­mizer is de­cep­tive, it will be­gin to mon­i­tor its in­put for op­por­tu­ni­ties to defect. Dur­ing train­ing, a de­cep­tively al­igned op­ti­mizer is strongly dis­cour­aged from defect­ing be­cause it knows that the base op­ti­mizer will se­lect against it. How­ever, once a de­cep­tively al­igned mesa-op­ti­mizer is free of the threat of ex­ter­nal mod­ifi­ca­tion, this no longer ap­plies. Thus, a de­cep­tively al­igned mesa-op­ti­mizer will be look­ing for ways to de­ter­mine whether or not it is in train­ing—that is, for the pres­ence of a dis­tri­bu­tional shift.

This means that the detri­men­tal effects of de­cep­tive al­ign­ment will only be ex­ter­nally visi­ble when the mesa-op­ti­mizer defects—ei­ther ran­domly or af­ter un­der­go­ing a dis­tri­bu­tional shift. How­ever, the un­der­ly­ing prob­lem is not the dis­tri­bu­tional shift it­self, but the failure to al­ign the mesa-op­ti­mizer in the first place. One can ei­ther try to solve the un­der­ly­ing prob­lem, by mak­ing the mesa-op­ti­mizer ro­bust to the dis­tri­bu­tional shift (i.e., ro­bustly al­igned), or one can try to patch the prob­lem by pre­vent­ing any dis­tri­bu­tional shift from oc­cur­ring.

In cur­rent AI sys­tems, a small amount of dis­tri­bu­tional shift be­tween train­ing and de­ploy­ment need not be prob­le­matic: so long as the differ­ence is small enough in the task-rele­vant ar­eas, the train­ing dis­tri­bu­tion does not need to perfectly re­flect the de­ploy­ment dis­tri­bu­tion. How­ever, this may not be the case for a de­cep­tively al­igned mesa-op­ti­mizer. If a de­cep­tively al­igned mesa-op­ti­mizer is suffi­ciently ad­vanced, it may de­tect very sub­tle dis­tri­bu­tional shifts for the pur­pose of in­fer­ring when the threat of mod­ifi­ca­tion has ceased.

This raises the ques­tion of whether it would even be pos­si­ble to con­tain a de­cep­tive mesa-op­ti­mizer by com­pletely re­mov­ing any de­tectable differ­ence be­tween train­ing and de­ploy­ment. There are a num­ber of rea­sons to be­lieve that this might not be fea­si­ble in prac­tice.

First, it would sig­nifi­cantly limit the use cases for the sys­tem. Ma­chine learn­ing is of­ten done with the in­tent of mak­ing sys­tems that can solve new prob­lems. If we re­quire that there is no dis­tri­bu­tional shift of any form, then we can­not al­low the sys­tem to en­counter prob­lems that it could not have en­coun­tered dur­ing train­ing, which means that we can­not give it truly new prob­lems. For ex­am­ple, train­ing the sys­tem in a simu­lated en­vi­ron­ment and de­ploy­ing it in the real world would likely in­volve enough de­tectable dis­tri­bu­tional shift to make it in­fea­si­ble un­der this ap­proach.

Se­cond, there are sub­tle forms of dis­tri­bu­tional shift that seem very difficult to elimi­nate. For ex­am­ple, if the mesa-op­ti­mizer is able to de­tect the pres­ence of pa­ram­e­ter up­dates—via any state car­ried over be­tween steps, for in­stance—then it would always ex­pe­rience a form of dis­tri­bu­tional shift at de­ploy­ment. Run­ning the sys­tem for a longer amount of time (or giv­ing the sys­tem ac­cess to some form of long-term mem­o­ries) would also con­sti­tute a form of dis­tri­bu­tional shift. Some ex­am­ples of differ­ences that a mesa-op­ti­mizer might be able to de­tect in­clude:

  • Train­ing in a simu­lated en­vi­ron­ment and de­ploy­ing in the real world.

  • Run­ning the sys­tem for a differ­ent amount of time dur­ing train­ing and test­ing.

  • If a model is able to learn from its past ex­pe­riences, then hav­ing ac­cess to more mem­o­ries, or hav­ing ac­cess to the kinds of mem­o­ries it would form at later time steps (if there is a differ­ence be­tween the two).

  • The pres­ence of ad­ver­sar­ial-ex­am­ple-type at­tacks by an ad­ver­sary dur­ing de­ploy­ment.

  • En­coun­ter­ing some­thing that is very hard to com­pute dur­ing train­ing, but that is likely to even­tu­ally be com­puted dur­ing de­ploy­ment (e.g. a fac­tor­iza­tion of RSA 2048(26)).

  • The abil­ity of the base op­ti­mizer to mod­ify the learned al­gorithm dur­ing train­ing but not at other times.

  • The abil­ity of the pro­gram­mers to cor­rect for per­ceived er­rors dur­ing train­ing and test­ing but not de­ploy­ment.

  • The pres­ence or ab­sence of good op­por­tu­ni­ties for the mesa-op­ti­mizer to defect against its pro­gram­mers.

Fur­ther­more, it should be noted that it is not strictly nec­es­sary that a mesa-op­ti­mizer be able to de­tect the differ­ence be­tween train­ing and de­ploy­ment in or­der for it to be in­cen­tivized to be­have de­cep­tively. Con­sider again the ex­am­ple in figure 4.1, but sup­pose this time that the mesa-op­ti­mizer is un­able to de­tect which epi­sode it is cur­rently in. The mesa-op­ti­mizer is then fac­ing a de­ci­sion prob­lem similar to the “ab­sent-minded driver prob­lem.”(27) If we ap­ply the same rea­son­ing as in the ab­sent-minded driver prob­lem to our ex­am­ple in figure 4.1, the plan­ning-op­ti­mal policy for the mesa-op­ti­mizer is to defect with prob­a­bil­ity . Thus, the mesa-op­ti­mizer need not be able to de­tect when train­ing has ended for it to be in­cen­tivized to “gam­ble” and ran­domly defect with some prob­a­bil­ity. Fur­ther­more, a de­cep­tively al­igned mesa-op­ti­mizer might also try to look for op­por­tu­ni­ties to defect that would be ex­cep­tion­ally good con­di­tional on it cur­rently not be­ing trained, but that are rare enough to be un­likely to come up dur­ing train­ing.

The fifth and fi­nal post in the Risks from Learned Op­ti­miza­tion Se­quence, ti­tled “Risks from Learned Op­ti­miza­tion: Con­clu­sion and Re­lated Work,” can be found here.

Glos­sary | Bibliography


  1. Though for clar­ity we speak of the same agent per­sist­ing across pa­ram­e­ter up­dates, nei­ther we nor the agent must rea­son in this way. A more gen­eral de­scrip­tion of the situ­a­tion is that there is a se­quence of “suc­ces­sor agents” con­nected by pa­ram­e­ter up­dates. The agent can po­ten­tially learn the effects of the pa­ram­e­ter up­dates on its suc­ces­sors, and thus on the multi-epi­sode re­turn it is op­ti­miz­ing. ↩︎

  2. The con­cept of de­cep­tive al­ign­ment is closely re­lated to Man­heim and Garrabrant’s con­cept of “ad­ver­sar­ial Good­hart.”(22) In “Cat­e­go­riz­ing Var­i­ants of Good­hart’s Law,” Man­heim and Garrabrant de­scribe ad­ver­sar­ial Good­hart as when “an agent with differ­ent goals than the reg­u­la­tor” causes a “col­lapse of the statis­ti­cal re­la­tion­ship be­tween a goal which the op­ti­mizer in­tends and the proxy used for that goal.” De­cep­tive al­ign­ment can be seen as an in­stance of ad­ver­sar­ial Good­hart in which the mesa-op­ti­mizer plays the role of the agent with differ­ent goals and the base op­ti­mizer plays the role of the reg­u­la­tor. Train­ing perfor­mance is the reg­u­la­tor’s proxy, and the de­ploy­ment perfor­mance is the reg­u­la­tor’s ac­tual goal. ↩︎

  3. Thanks to Paul Chris­ti­ano for the in­ter­nal­iza­tion/​mod­el­ing dis­tinc­tion. ↩︎

  4. Note, how­ever, that it is pos­si­ble for in­for­ma­tion about the base ob­jec­tive to be in­tro­duced via op­ti­miza­tion, but not form part of the mesa-ob­jec­tive. For ex­am­ple, it could form part of a prior on tasks, hav­ing no di­rect bear­ing on the mesa-ob­jec­tive. ↩︎

  5. Chris­ti­ano refers to de­cep­tively al­igned mesa-op­ti­miz­ers as “in­fluence-seek­ing sys­tems.” ↩︎

  6. Or must oth­er­wise have some method of achiev­ing more value from defect­ing later rather than defect­ing now. ↩︎

  7. Note that it is not re­quired that the mesa-op­ti­mizer be able to model (or in­fer the ex­is­tence of) the base op­ti­mizer; it only needs to model the op­ti­miza­tion pres­sure it is sub­ject to. ↩︎

  8. Alter­na­tively, the threat could just be that the mesa-op­ti­mizer is never se­lected by the base op­ti­mizer in the first place. In that case, this con­di­tion would be triv­ially satis­fied, as the threat to a mesa-op­ti­mizer of not be­ing gen­er­ated by the base op­ti­mizer at all would go away as soon as the base op­ti­mizer does in fact gen­er­ate it. ↩︎

  9. We will gen­er­ally be as­sum­ing that the se­lec­tion pres­sure takes the form of the base op­ti­mizer mod­ify­ing the mesa-op­ti­mizer post-hoc, as we are mostly think­ing about lo­cal op­ti­miza­tion pro­cesses, though one could also imag­ine a base op­ti­mizer that in­stead de­vel­ops a com­pletely new in­ner al­gorithm at each time step. Though some of these is­sues might still ap­ply even in such a case, we will gen­er­ally not be con­sid­er­ing such base op­ti­miz­ers in this sec­tion. ↩︎

  10. Note that if the mesa-op­ti­mizer also be­lieves that the op­por­tu­ni­ties it cur­rently has for achiev­ing its own ob­jec­tive are un­usu­ally good then it might still pur­sue its ob­jec­tive greed­ily, but this would pre­sum­ably not always be the case. ↩︎

  11. An analo­gous ar­gu­ment ap­plies for a de­cep­tively al­igned mesa-op­ti­mizer that is defect­ing too early, or one that un­suc­cess­fully at­tempts a treach­er­ous turn. ↩︎

  12. Fur­ther­more, a de­cep­tively al­igned mesa-op­ti­mizer would be in­cen­tivized to cause there to be a sys­tem­atic bias in the di­rec­tion of pre­vent­ing the base op­ti­mizer from mod­ify­ing its mesa-ob­jec­tive. Thus, in the con­text of a lo­cal op­ti­miza­tion pro­cess, a de­cep­tive mesa-op­ti­mizer might try to “hack” its own gra­di­ent (by, for ex­am­ple, mak­ing it­self more brit­tle in the case where its ob­jec­tive gets changed) to en­sure that the base op­ti­mizer ad­justs it in such a way that leaves its mesa-ob­jec­tive un­touched. ↩︎