The Inner Alignment Problem

This is the third of five posts in the Mesa-Op­ti­miza­tion Se­quence based on the MIRI pa­per “Risks from Learned Op­ti­miza­tion in Ad­vanced Ma­chine Learn­ing Sys­tems” by Evan Hub­inger, Chris van Mer­wijk, Vladimir Mikulik, Joar Skalse, and Scott Garrabrant. Each post in the se­quence cor­re­sponds to a differ­ent sec­tion of the pa­per.

In this post, we out­line rea­sons to think that a mesa-op­ti­mizer may not op­ti­mize the same ob­jec­tive func­tion as its base op­ti­mizer. Ma­chine learn­ing prac­ti­tion­ers have di­rect con­trol over the base ob­jec­tive func­tion—ei­ther by spec­i­fy­ing the loss func­tion di­rectly or train­ing a model for it—but can­not di­rectly spec­ify the mesa-ob­jec­tive de­vel­oped by a mesa-op­ti­mizer. We re­fer to this prob­lem of al­ign­ing mesa-op­ti­miz­ers with the base ob­jec­tive as the in­ner al­ign­ment prob­lem. This is dis­tinct from the outer al­ign­ment prob­lem, which is the tra­di­tional prob­lem of en­sur­ing that the base ob­jec­tive cap­tures the in­tended goal of the pro­gram­mers.

Cur­rent ma­chine learn­ing meth­ods se­lect learned al­gorithms by em­piri­cally eval­u­at­ing their perfor­mance on a set of train­ing data ac­cord­ing to the base ob­jec­tive func­tion. Thus, ML base op­ti­miz­ers se­lect mesa-op­ti­miz­ers ac­cord­ing to the out­put they pro­duce rather than di­rectly se­lect­ing for a par­tic­u­lar mesa-ob­jec­tive. More­over, the se­lected mesa-op­ti­mizer’s policy only has to perform well (as scored by the base ob­jec­tive) on the train­ing data. If we adopt the as­sump­tion that the mesa-op­ti­mizer com­putes an op­ti­mal policy given its ob­jec­tive func­tion, then we can sum­ma­rize the re­la­tion­ship be­tween the base and mesa- ob­jec­tives as fol­lows:(17) That is, the base op­ti­mizer max­i­mizes its ob­jec­tive by choos­ing a mesa-op­ti­mizer with pa­ram­e­ter­i­za­tion based on the mesa-op­ti­mizer’s policy , but not based on the ob­jec­tive func­tion that the mesa-op­ti­mizer uses to com­pute this policy. Depend­ing on the base op­ti­mizer, we will think of as the nega­tive of the loss, the fu­ture dis­counted re­ward, or sim­ply some fit­ness func­tion by which learned al­gorithms are be­ing se­lected.

An in­ter­est­ing ap­proach to an­a­lyz­ing this con­nec­tion is pre­sented in Ibarz et al, where em­piri­cal sam­ples of the true re­ward and a learned re­ward on the same tra­jec­to­ries are used to cre­ate a scat­ter-plot vi­su­al­iza­tion of the al­ign­ment be­tween the two.(18) The as­sump­tion in that work is that a mono­tonic re­la­tion­ship be­tween the learned re­ward and true re­ward in­di­cates al­ign­ment, whereas de­vi­a­tions from that sug­gest mis­al­ign­ment. Build­ing on this sort of re­search, bet­ter the­o­ret­i­cal mea­sures of al­ign­ment might some­day al­low us to speak con­cretely in terms of prov­able guaran­tees about the ex­tent to which a mesa-op­ti­mizer is al­igned with the base op­ti­mizer that cre­ated it.

3.1. Pseudo-alignment

There is cur­rently no com­plete the­ory of the fac­tors that af­fect whether a mesa-op­ti­mizer will be pseudo-al­igned—that is, whether it will ap­pear al­igned on the train­ing data, while ac­tu­ally op­ti­miz­ing for some­thing other than the base ob­jec­tive. Nev­er­the­less, we out­line a ba­sic clas­sifi­ca­tion of ways in which a mesa-op­ti­mizer could be pseudo-al­igned:

  1. Proxy al­ign­ment,

  2. Ap­prox­i­mate al­ign­ment, and

  3. Subop­ti­mal­ity al­ign­ment.

Proxy al­ign­ment. The ba­sic idea of proxy al­ign­ment is that a mesa-op­ti­mizer can learn to op­ti­mize for some proxy of the base ob­jec­tive in­stead of the base ob­jec­tive it­self. We’ll start by con­sid­er­ing two spe­cial cases of proxy al­ign­ment: side-effect al­ign­ment and in­stru­men­tal al­ign­ment.

First, a mesa-op­ti­mizer is side-effect al­igned if op­ti­miz­ing for the mesa-ob­jec­tive has the di­rect causal re­sult of in­creas­ing the base ob­jec­tive in the train­ing dis­tri­bu­tion, and thus when the mesa-op­ti­mizer op­ti­mizes it re­sults in an in­crease in . For an ex­am­ple of side-effect al­ign­ment, sup­pose that we are train­ing a clean­ing robot. Con­sider a robot that op­ti­mizes the num­ber of times it has swept a dusty floor. Sweep­ing a floor causes the floor to be cleaned, so this robot would be given a good score by the base op­ti­mizer. How­ever, if dur­ing de­ploy­ment it is offered a way to make the floor dusty again af­ter clean­ing it (e.g. by scat­ter­ing the dust it swept up back onto the floor), the robot will take it, as it can then con­tinue sweep­ing dusty floors.

Se­cond, a mesa-op­ti­mizer is in­stru­men­tally al­igned if op­ti­miz­ing for the base ob­jec­tive has the di­rect causal re­sult of in­creas­ing the mesa-ob­jec­tive in the train­ing dis­tri­bu­tion, and thus the mesa-op­ti­mizer op­ti­mizes as an in­stru­men­tal goal for the pur­pose of in­creas­ing . For an ex­am­ple of in­stru­men­tal al­ign­ment, sup­pose again that we are train­ing a clean­ing robot. Con­sider a robot that op­ti­mizes the amount of dust in the vac­uum cleaner. Sup­pose that in the train­ing dis­tri­bu­tion the eas­iest way to get dust into the vac­uum cleaner is to vac­uum the dust on the floor. It would then do a good job of clean­ing in the train­ing dis­tri­bu­tion and would be given a good score by the base op­ti­mizer. How­ever, if dur­ing de­ploy­ment the robot came across a more effec­tive way to ac­quire dust—such as by vac­u­um­ing the soil in a pot­ted plant—then it would no longer ex­hibit the de­sired be­hav­ior.

We pro­pose that it is pos­si­ble to un­der­stand the gen­eral in­ter­ac­tion be­tween side-effect and in­stru­men­tal al­ign­ment us­ing causal graphs, which leads to our gen­eral no­tion of proxy al­ign­ment.

Sup­pose we model a task as a causal graph with nodes for all pos­si­ble at­tributes of that task and ar­rows be­tween nodes for all pos­si­ble re­la­tion­ships be­tween those at­tributes. Then we can also think of the mesa-ob­jec­tive and the base ob­jec­tive as nodes in this graph. For to be pseudo-al­igned, there must ex­ist some node such that is an an­ces­tor of both and in the train­ing dis­tri­bu­tion, and such that and in­crease with . If , this is side-effect al­ign­ment, and if , this is in­stru­men­tal al­ign­ment.

This rep­re­sents the most gen­er­al­ized form of a re­la­tion­ship be­tween and that can con­tribute to pseudo-al­ign­ment. Speci­fi­cally, con­sider the causal graph given in figure 3.1. A mesa-op­ti­mizer with mesa-ob­jec­tive will de­cide to op­ti­mize as an in­stru­men­tal goal of op­ti­miz­ing , since in­creases . This will then re­sult in in­creas­ing, since op­ti­miz­ing for has the side-effect of in­creas­ing . Thus, in the gen­eral case, side-effect and in­stru­men­tal al­ign­ment can work to­gether to con­tribute to pseudo-al­ign­ment over the train­ing dis­tri­bu­tion, which is the gen­eral case of proxy al­ign­ment.

Figure 3.1. A causal di­a­gram of the train­ing en­vi­ron­ment for the differ­ent types of proxy al­ign­ment. The di­a­grams rep­re­sent, from top to bot­tom, side-effect al­ign­ment (top), in­stru­men­tal al­ign­ment (mid­dle), and gen­eral proxy al­ign­ment (bot­tom). The ar­rows rep­re­sent pos­i­tive causal re­la­tion­ships—that is, cases where an in­crease in the par­ent causes an in­crease in the child.

Ap­prox­i­mate al­ign­ment. A mesa-op­ti­mizer is ap­prox­i­mately al­igned if the mesa-ob­jec­tive and the base ob­jec­tive are ap­prox­i­mately the same func­tion up to some de­gree of ap­prox­i­ma­tion er­ror re­lated to the fact that the mesa-ob­jec­tive has to be rep­re­sented in­side the mesa-op­ti­mizer rather than be­ing di­rectly pro­grammed by hu­mans. For ex­am­ple, sup­pose you task a neu­ral net­work with op­ti­miz­ing for some base ob­jec­tive that is im­pos­si­ble to perfectly rep­re­sent in the neu­ral net­work it­self. Even if you get a mesa-op­ti­mizer that is as al­igned as pos­si­ble, it still will not be perfectly ro­bustly al­igned in this sce­nario, since there will have to be some de­gree of ap­prox­i­ma­tion er­ror be­tween its in­ter­nal rep­re­sen­ta­tion of the base ob­jec­tive and the ac­tual base ob­jec­tive.

Subop­ti­mal­ity al­ign­ment. A mesa-op­ti­mizer is sub­op­ti­mal­ity al­igned if some defi­ciency, er­ror, or limi­ta­tion in its op­ti­miza­tion pro­cess causes it to ex­hibit al­igned be­hav­ior on the train­ing dis­tri­bu­tion. This could be due to com­pu­ta­tional con­straints, un­sound rea­son­ing, a lack of in­for­ma­tion, ir­ra­tional de­ci­sion pro­ce­dures, or any other defect in the mesa-op­ti­mizer’s rea­son­ing pro­cess. Im­por­tantly, we are not refer­ring to a situ­a­tion where the mesa-op­ti­mizer is ro­bustly al­igned but nonethe­less makes mis­takes lead­ing to bad out­comes on the base ob­jec­tive. Rather, sub­op­ti­mal­ity al­ign­ment refers to the situ­a­tion where the mesa-op­ti­mizer is mis­al­igned but nev­er­the­less performs well on the base ob­jec­tive, pre­cisely be­cause it has been se­lected to make mis­takes that lead to good out­comes on the base ob­jec­tive.

For an ex­am­ple of sub­op­ti­mal­ity al­ign­ment, con­sider a clean­ing robot with a mesa-ob­jec­tive of min­i­miz­ing the to­tal amount of stuff in ex­is­tence. If this robot has the mis­taken be­lief that the dirt it cleans is com­pletely de­stroyed, then it may be use­ful for clean­ing the room de­spite do­ing so not ac­tu­ally helping it suc­ceed at its ob­jec­tive. This robot will be ob­served to be a good op­ti­mizer of and hence be given a good score by the base op­ti­mizer. How­ever, if dur­ing de­ploy­ment the robot is able to im­prove its world model, it will stop ex­hibit­ing the de­sired be­hav­ior.

As an­other, per­haps more re­al­is­tic ex­am­ple of sub­op­ti­mal­ity al­ign­ment, con­sider a mesa-op­ti­mizer with a mesa-ob­jec­tive and an en­vi­ron­ment in which there is one sim­ple strat­egy and one com­pli­cated strat­egy for achiev­ing . It could be that the sim­ple strat­egy is al­igned with the base op­ti­mizer, but the com­pli­cated strat­egy is not. The mesa-op­ti­mizer might then ini­tially only be aware of the sim­ple strat­egy, and thus be sub­op­ti­mal­ity al­igned, un­til it has been run for long enough to come up with the com­pli­cated strat­egy, at which point it stops ex­hibit­ing the de­sired be­hav­ior.

3.2. The task

As in the sec­ond post, we will now con­sider the task the ma­chine learn­ing sys­tem is trained on. Speci­fi­cally, we will ad­dress how the task af­fects a ma­chine learn­ing sys­tem’s propen­sity to pro­duce pseudo-al­igned mesa-op­ti­miz­ers.

Uniden­ti­fi­a­bil­ity. It is a com­mon prob­lem in ma­chine learn­ing for a dataset to not con­tain enough in­for­ma­tion to ad­e­quately pin­point a spe­cific con­cept. This is closely analo­gous to the rea­son that ma­chine learn­ing mod­els can fail to gen­er­al­ize or be sus­cep­ti­ble to ad­ver­sar­ial ex­am­ples(19)—there are many more ways of clas­sify­ing data that do well in train­ing than any spe­cific way the pro­gram­mers had in mind. In the con­text of mesa-op­ti­miza­tion, this man­i­fests as pseudo-al­ign­ment be­ing more likely to oc­cur when a train­ing en­vi­ron­ment does not con­tain enough in­for­ma­tion to dis­t­in­guish be­tween a wide va­ri­ety of differ­ent ob­jec­tive func­tions. In such a case there will be many more ways for a mesa-op­ti­mizer to be pseudo-al­igned than ro­bustly al­igned—one for each in­dis­t­in­guish­able ob­jec­tive func­tion. Thus, most mesa-op­ti­miz­ers that do well on the base ob­jec­tive will be pseudo-al­igned rather than ro­bustly al­igned. This is a crit­i­cal con­cern be­cause it makes ev­ery other prob­lem of pseudo-al­ign­ment worse—it is a rea­son that, in gen­eral, it is hard to find ro­bustly al­igned mesa-op­ti­miz­ers. Uniden­ti­fi­a­bil­ity in mesa-op­ti­miza­tion is par­tially analo­gous to the prob­lem of uniden­ti­fi­a­bil­ity in re­ward learn­ing, in that the cen­tral is­sue is iden­ti­fy­ing the “cor­rect” ob­jec­tive func­tion given par­tic­u­lar train­ing data.(20) We will dis­cuss this re­la­tion­ship fur­ther in the fifth post.

In the con­text of mesa-op­ti­miza­tion, there is also an ad­di­tional source of uniden­ti­fi­a­bil­ity stem­ming from the fact that the mesa-op­ti­mizer is se­lected merely on the ba­sis of its out­put. Con­sider the fol­low­ing toy re­in­force­ment learn­ing ex­am­ple. Sup­pose that in the train­ing en­vi­ron­ment, press­ing a but­ton always causes a lamp to turn on with a ten-sec­ond de­lay, and that there is no other way to turn on the lamp. If the base ob­jec­tive de­pends only on whether the lamp is turned on, then a mesa-op­ti­mizer that max­i­mizes but­ton presses and one that max­i­mizes lamp light will show iden­ti­cal be­hav­ior, as they will both press the but­ton as of­ten as they can. Thus, we can­not dis­t­in­guish these two ob­jec­tive func­tions in this train­ing en­vi­ron­ment. Nev­er­the­less, the train­ing en­vi­ron­ment does con­tain enough in­for­ma­tion to dis­t­in­guish at least be­tween these two par­tic­u­lar ob­jec­tives: since the high re­ward only comes af­ter the ten-sec­ond de­lay, it must be from the lamp, not the but­ton. As such, even if a train­ing en­vi­ron­ment in prin­ci­ple con­tains enough in­for­ma­tion to iden­tify the base ob­jec­tive, it might still be im­pos­si­ble to dis­t­in­guish ro­bustly al­igned from proxy-al­igned mesa-op­ti­miz­ers.

Proxy choice as pre-com­pu­ta­tion. Proxy al­ign­ment can be seen as a form of pre-com­pu­ta­tion by the base op­ti­mizer. Proxy al­ign­ment al­lows the base op­ti­mizer to save the mesa-op­ti­mizer com­pu­ta­tional work by pre-com­put­ing which prox­ies are valuable for the base ob­jec­tive and then let­ting the mesa-op­ti­mizer max­i­mize those prox­ies.

Without such pre-com­pu­ta­tion, the mesa-op­ti­mizer has to in­fer at run­time the causal re­la­tion­ship be­tween differ­ent in­put fea­tures and the base ob­jec­tive, which might re­quire sig­nifi­cant com­pu­ta­tional work. More­over, er­rors in this in­fer­ence could re­sult in out­puts that perform worse on the base ob­jec­tive than if the sys­tem had ac­cess to pre-com­puted prox­ies. If the base op­ti­mizer pre­com­putes some of these causal re­la­tion­ships—by se­lect­ing the mesa-ob­jec­tive to in­clude good prox­ies—more com­pu­ta­tion at run­time can be di­verted to mak­ing bet­ter plans in­stead of in­fer­ring these re­la­tion­ships.

The case of biolog­i­cal evolu­tion may illus­trate this point. The prox­ies that hu­mans care about—food, re­sources, com­mu­nity, mat­ing, etc.—are rel­a­tively com­pu­ta­tion­ally easy to op­ti­mize di­rectly, while cor­re­lat­ing well with sur­vival and re­pro­duc­tion in our an­ces­tral en­vi­ron­ment. For a hu­man to be ro­bustly al­igned with evolu­tion would have re­quired us to in­stead care di­rectly about spread­ing our genes, in which case we would have to in­fer that eat­ing, co­op­er­at­ing with oth­ers, pre­vent­ing phys­i­cal pain, etc. would pro­mote ge­netic fit­ness in the long run, which is not a triv­ial task. To in­fer all of those prox­ies from the in­for­ma­tion available to early hu­mans would have re­quired greater (per­haps un­fea­si­bly greater) com­pu­ta­tional re­sources than to sim­ply op­ti­mize for them di­rectly. As an ex­treme illus­tra­tion, for a child in this al­ter­nate uni­verse to figure out not to stub its toe, it would have to re­al­ize that do­ing so would slightly diminish its chances of re­pro­duc­ing twenty years later.

For pre-com­pu­ta­tion to be benefi­cial, there needs to be a rel­a­tively sta­ble causal re­la­tion­ship be­tween a proxy vari­able and the base ob­jec­tive such that op­ti­miz­ing for the proxy will con­sis­tently do well on the base ob­jec­tive. How­ever, even an im­perfect re­la­tion­ship might give a sig­nifi­cant perfor­mance boost over ro­bust al­ign­ment if it frees up the mesa-op­ti­mizer to put sig­nifi­cantly more com­pu­ta­tional effort into op­ti­miz­ing its out­put. This anal­y­sis sug­gests that there might be pres­sure to­wards proxy al­ign­ment in com­plex train­ing en­vi­ron­ments, since the more com­plex the en­vi­ron­ment, the more com­pu­ta­tional work pre-com­pu­ta­tion saves the mesa-op­ti­mizer. Ad­di­tion­ally, the more com­plex the en­vi­ron­ment, the more po­ten­tial proxy vari­ables are available for the mesa-op­ti­mizer to use.

Fur­ther­more, in the con­text of ma­chine learn­ing, this anal­y­sis sug­gests that a time com­plex­ity penalty (as op­posed to a de­scrip­tion length penalty) is a dou­ble-edged sword. In the sec­ond post, we sug­gested that pe­nal­iz­ing time com­plex­ity might serve to re­duce the like­li­hood of mesa-op­ti­miza­tion. How­ever, the above sug­gests that do­ing so would also pro­mote pseudo-al­ign­ment in those cases where mesa-op­ti­miz­ers do arise. If the cost of fully mod­el­ing the base ob­jec­tive in the mesa-op­ti­mizer is large, then a pseudo-al­igned mesa-op­ti­mizer might be preferred sim­ply be­cause it re­duces time com­plex­ity, even if it would un­der­perform a ro­bustly al­igned mesa-op­ti­mizer with­out such a penalty.

Com­pres­sion of the mesa-op­ti­mizer. The de­scrip­tion length of a ro­bustly al­igned mesa-op­ti­mizer may be greater than that of a pseudo-al­igned mesa-op­ti­mizer. Since there are more pseudo-al­igned mesa-ob­jec­tives than ro­bustly al­igned mesa-ob­jec­tives, pseudo-al­ign­ment pro­vides more de­grees of free­dom for choos­ing a par­tic­u­larly sim­ple mesa-ob­jec­tive. Thus, we ex­pect that in most cases there will be sev­eral pseudo-al­igned mesa-op­ti­miz­ers that are less com­plex than any ro­bustly al­igned mesa-op­ti­mizer.

This de­scrip­tion cost is es­pe­cially high if the learned al­gorithm’s in­put data does not con­tain easy-to-in­fer in­for­ma­tion about how to op­ti­mize for the base ob­jec­tive. Biolog­i­cal evolu­tion seems to differ from ma­chine learn­ing in this sense, since evolu­tion’s speci­fi­ca­tion of the brain has to go through the in­for­ma­tion fun­nel of DNA. The sen­sory data that early hu­mans re­ceived didn’t al­low them to in­fer the ex­is­tence of DNA, nor the re­la­tion­ship be­tween their ac­tions and their ge­netic fit­ness. There­fore, for hu­mans to have been al­igned with evolu­tion would have re­quired them to have an in­nately speci­fied model of DNA, as well as the var­i­ous fac­tors in­fluenc­ing their in­clu­sive ge­netic fit­ness. Such a model would not have been able to make use of en­vi­ron­men­tal in­for­ma­tion for com­pres­sion, and thus would have re­quired a greater de­scrip­tion length. In con­trast, our mod­els of food, pain, etc. can be very short since they are di­rectly re­lated to our in­put data.

3.3. The base optimizer

We now turn to how the base op­ti­mizer is likely to af­fect the propen­sity for a ma­chine learn­ing sys­tem to pro­duce pseudo-al­igned mesa-op­ti­miz­ers.

Hard-coded op­ti­miza­tion. In the sec­ond post, we sug­gested that hard-cod­ing an op­ti­miza­tion al­gorithm—that is to say, choos­ing a model with built-in op­ti­miza­tion—could be used to re­move some of the in­cen­tives for mesa-op­ti­miza­tion. Similarly, hard-coded op­ti­miza­tion may be used to pre­vent some of the sources of pseudo-al­ign­ment, since it may al­low one to di­rectly spec­ify or train the mesa-ob­jec­tive. Re­ward-pre­dic­tive model-based re­in­force­ment learn­ing might be one pos­si­ble way of ac­com­plish­ing this.(21) For ex­am­ple, an ML sys­tem could in­clude a model di­rectly trained to pre­dict the base ob­jec­tive to­gether with a pow­er­ful hard-coded op­ti­miza­tion al­gorithm. Do­ing this by­passes some of the prob­lems of pseudo-al­ign­ment: if the mesa-op­ti­mizer is trained to di­rectly pre­dict the base re­ward, then it will be se­lected to make good pre­dic­tions even if a bad pre­dic­tion would re­sult in a good policy. How­ever, a learned model of the base ob­jec­tive will still be un­der­de­ter­mined off-dis­tri­bu­tion, so this ap­proach by it­self does not guaran­tee ro­bust al­ign­ment.

Al­gorith­mic range. We hy­poth­e­size that a model’s al­gorith­mic range will have im­pli­ca­tions for how likely it is to de­velop pseudo-al­ign­ment. One pos­si­ble source of pseudo-al­ign­ment that could be par­tic­u­larly difficult to avoid is ap­prox­i­ma­tion er­ror—if a mesa-op­ti­mizer is not ca­pa­ble of faith­fully rep­re­sent­ing the base ob­jec­tive, then it can’t pos­si­bly be ro­bustly al­igned, only ap­prox­i­mately al­igned. Even if a mesa-op­ti­mizer might the­o­ret­i­cally be able to perfectly cap­ture the base ob­jec­tive, the more difficult that is for it to do, the more we might ex­pect it to be ap­prox­i­mately al­igned rather than ro­bustly al­igned. Thus, a large al­gorith­mic range may be both a bless­ing and a curse: it makes it less likely that mesa-op­ti­miz­ers will be ap­prox­i­mately al­igned, but it also in­creases the like­li­hood of get­ting a mesa-op­ti­mizer in the first place.[1]

Subpro­cess in­ter­de­pen­dence. There are some rea­sons to be­lieve that there might be more ini­tial op­ti­miza­tion pres­sure to­wards proxy al­igned than ro­bustly al­igned mesa-op­ti­miz­ers. In a lo­cal op­ti­miza­tion pro­cess, each pa­ram­e­ter of the learned al­gorithm (e.g. the pa­ram­e­ter vec­tor of a neu­ron) is ad­justed to lo­cally im­prove the base ob­jec­tive con­di­tional on the other pa­ram­e­ters. Thus, the benefit for the base op­ti­mizer of de­vel­op­ing a new sub­pro­cess will likely de­pend on what other sub­pro­cesses the learned al­gorithm cur­rently im­ple­ments. There­fore, even if some sub­pro­cess would be very benefi­cial if com­bined with many other sub­pro­cesses, the base op­ti­mizer may not se­lect for it un­til the sub­pro­cesses it de­pends on are suffi­ciently de­vel­oped. As a re­sult, a lo­cal op­ti­miza­tion pro­cess would likely re­sult in sub­pro­cesses that have fewer de­pen­den­cies be­ing de­vel­oped be­fore those with more de­pen­den­cies.

In the con­text of mesa-op­ti­miza­tion, the benefit of a ro­bustly al­igned mesa-ob­jec­tive seems to de­pend on more sub­pro­cesses than at least some pseudo-al­igned mesa-ob­jec­tives. For ex­am­ple, con­sider a side-effect al­igned mesa-op­ti­mizer op­ti­miz­ing for some set of proxy vari­ables. Sup­pose that it needs to run some sub­pro­cess to model the re­la­tion­ship be­tween its ac­tions and those proxy vari­ables. If we as­sume that op­ti­miz­ing the proxy vari­ables is nec­es­sary to perform well on the base ob­jec­tive, then for a mesa-op­ti­mizer to be ro­bustly al­igned, it would also need to model the causal re­la­tion­ship be­tween those proxy vari­ables and the base ob­jec­tive, which might re­quire ad­di­tional sub­pro­cesses. More­over, the benefit to the base op­ti­mizer of adding those sub­pro­cesses de­pends on the mesa-op­ti­mizer hav­ing ad­di­tional sub­pro­cesses to model the re­la­tion­ship be­tween its ac­tions and those proxy vari­ables. This in­for­mal ar­gu­ment sug­gests that if a mesa-op­ti­mizer’s com­pu­ta­tion neatly fac­tors in this way, then de­vel­op­ing a ro­bustly al­igned mesa-ob­jec­tive may re­quire strictly more sub­pro­cesses than de­vel­op­ing a pseudo-al­igned mesa-ob­jec­tive.

This sug­gests that, at least in a lo­cal op­ti­miza­tion pro­cess, mesa-op­ti­miz­ers might tend to start their de­vel­op­ment as proxy al­igned be­fore be­com­ing ro­bustly al­igned. In other words, rather than si­mul­ta­neously gain­ing com­pe­tence and be­com­ing al­igned, we might ex­pect such a sys­tem to first be­come com­pe­tent at op­ti­miz­ing prox­ies, then pos­si­bly start be­com­ing more ro­bustly al­igned.

In­ter­nals-based se­lec­tion. In cur­rent ma­chine learn­ing sys­tems, learned al­gorithms are se­lected pri­mar­ily based on how their poli­cies perform on the base ob­jec­tive. There is of­ten some se­lec­tion based on the in­ter­nal struc­ture of the learned al­gorithm, such as weight reg­u­lariza­tion, but such se­lec­tion is usu­ally very coarse. One pos­si­ble idea for ad­dress­ing pseudo-al­ign­ment could be to use a se­lec­tion method based on a more so­phis­ti­cated func­tion of a learned al­gorithm’s in­ter­nal work­ings. This ap­proach is similar in spirit to hard-coded op­ti­miza­tion: we use knowl­edge of the sys­tem’s in­ter­nal work­ings to pre­vent it from be­ing proxy-al­igned.

One pos­si­bil­ity might be to de­velop a meta-learner that is speci­fi­cally trained to pre­vent pseudo-al­ign­ment. While this seems be­yond cur­rent meta-learn­ing tech­niques, the ba­sic prin­ci­ple is similar to ex­ist­ing meta-learn­ers de­signed to op­ti­mize for ro­bust­ness. An ex­treme form of in­ter­nals-based se­lec­tion might have the base op­ti­mizer rea­son about or prove prop­er­ties of the learned al­gorithm as part of its se­lec­tion pro­cess.

While de­sir­able, such meth­ods are cur­rently out of reach pri­mar­ily due to the limi­ta­tions of cur­rent trans­parency meth­ods. One pos­si­ble con­cern with all of these ap­proaches, how­ever, is that they might re­quire the base op­ti­mizer to be bet­ter at search­ing for prob­lems than the mesa-op­ti­mizer is at hid­ing them. Fur­ther­more, in or­der to cre­ate a base op­ti­mizer that can do this, we may need to first train it us­ing a third op­ti­mizer, which might re­sult in the same prob­lem at this sec­ond layer.

Ad­ver­sar­ial train­ing. One pos­si­ble method to alle­vi­ate some of these con­cerns might be to sam­ple en­vi­ron­ments ad­ver­sar­i­ally from a very wide dis­tri­bu­tion. With such ad­ver­sar­ial train­ing, it might be pos­si­ble to over­come these hur­dles and make the ro­bustly al­igned ob­jec­tive the most reach­able one. This would re­quire mak­ing it so that, for ev­ery mesa-ob­jec­tive that is more reach­able than the base ob­jec­tive , there is some situ­a­tion where and dis­agree and the mesa-op­ti­mizer is in­cen­tivized to agree with .

If we model reach­a­bil­ity of an ob­jec­tive as sim­ply its length in bits, then dis­t­in­guish­ing from ev­ery sin­gle more reach­able gets ex­po­nen­tially harder as gets more com­plex. Thus, for a very com­pli­cated , suffi­ciently in­cen­tiviz­ing the base op­ti­mizer to find a mesa-op­ti­mizer with that is likely to be very difficult, though not im­pos­si­ble.

Even in such a case, how­ever, there would likely still be a pe­riod of time where the learned al­gorithm is a mis­al­igned mesa-op­ti­mizer, leav­ing open an om­i­nous pos­si­bil­ity: the mis­al­igned mesa-op­ti­mizer could figure out the cor­rect ac­tions to take based on while its ob­jec­tive func­tion was still . We will call this situ­a­tion de­cep­tive al­ign­ment and will dis­cuss it at greater length in the next post.

The fourth post in the Mesa-Op­ti­miza­tion Se­quence, ti­tled “De­cep­tive Align­ment,” can be found here.

Glos­sary | Bibliography

  1. Though a large al­gorith­mic range seems to make ap­prox­i­mate al­ign­ment less likely, it is un­clear how it might af­fect other forms of pseudo-al­ign­ment such as de­cep­tive al­ign­ment. ↩︎