Conditions for Mesa-Optimization

This is the sec­ond of five posts in the Risks from Learned Op­ti­miza­tion Se­quence based on the pa­per “Risks from Learned Op­ti­miza­tion in Ad­vanced Ma­chine Learn­ing Sys­tems” by Evan Hub­inger, Chris van Mer­wijk, Vladimir Mikulik, Joar Skalse, and Scott Garrabrant. Each post in the se­quence cor­re­sponds to a differ­ent sec­tion of the pa­per.

In this post, we con­sider how the fol­low­ing two com­po­nents of a par­tic­u­lar ma­chine learn­ing sys­tem might in­fluence whether it will pro­duce a mesa-op­ti­mizer:

  1. The task: The train­ing dis­tri­bu­tion and base ob­jec­tive func­tion.

  2. The base op­ti­mizer: The ma­chine learn­ing al­gorithm and model ar­chi­tec­ture.

We de­liber­ately choose to pre­sent the­o­ret­i­cal con­sid­er­a­tions for why mesa-op­ti­miza­tion may or may not oc­cur rather than provide con­crete ex­am­ples. Mesa-op­ti­miza­tion is a phe­nomenon that we be­lieve will oc­cur mainly in ma­chine learn­ing sys­tems that are more ad­vanced than those that ex­ist to­day.[1] Thus, an at­tempt to in­duce mesa-op­ti­miza­tion in a cur­rent ma­chine learn­ing sys­tem would likely re­quire us to use an ar­tifi­cial setup speci­fi­cally de­signed to in­duce mesa-op­ti­miza­tion. More­over, the limited in­ter­pretabil­ity of neu­ral net­works, com­bined with the fact that there is no gen­eral and pre­cise defi­ni­tion of “op­ti­mizer,” means that it would be hard to eval­u­ate whether a given model is a mesa-op­ti­mizer.

2.1. The task

Some tasks benefit from mesa-op­ti­miz­ers more than oth­ers. For ex­am­ple, tic-tac-toe can be perfectly solved by sim­ple rules. Thus, a base op­ti­mizer has no need to gen­er­ate a mesa-op­ti­mizer to solve tic-tac-toe, since a sim­ple learned al­gorithm im­ple­ment­ing the rules for perfect play will do. Hu­man sur­vival in the sa­vanna, by con­trast, did seem to benefit from mesa-op­ti­miza­tion. Below, we dis­cuss the prop­er­ties of tasks that may in­fluence the like­li­hood of mesa-op­ti­miza­tion.

Bet­ter gen­er­al­iza­tion through search. To be able to con­sis­tently achieve a cer­tain level of perfor­mance in an en­vi­ron­ment, we hy­poth­e­size that there will always have to be some min­i­mum amount of op­ti­miza­tion power that must be ap­plied to find a policy that performs that well.

To see this, we can think of op­ti­miza­tion power as be­ing mea­sured in terms of the num­ber of times the op­ti­mizer is able to di­vide the search space in half—that is, the num­ber of bits of in­for­ma­tion pro­vided.(9) After these di­vi­sions, there will be some re­main­ing space of poli­cies that the op­ti­mizer is un­able to dis­t­in­guish be­tween. Then, to en­sure that all poli­cies in the re­main­ing space have some min­i­mum level of perfor­mance—to provide a perfor­mance lower bound[2] —will always re­quire the origi­nal space to be di­vided some min­i­mum num­ber of times—that is, there will always have to be some min­i­mum bits of op­ti­miza­tion power ap­plied.

How­ever, there are two dis­tinct lev­els at which this op­ti­miza­tion power could be ex­pended: the base op­ti­mizer could ex­pend op­ti­miza­tion power se­lect­ing a highly-tuned learned al­gorithm, or the learned al­gorithm could it­self ex­pend op­ti­miza­tion power se­lect­ing highly-tuned ac­tions.

As a mesa-op­ti­mizer is just a learned al­gorithm that it­self performs op­ti­miza­tion, the de­gree to which mesa-op­ti­miz­ers will be in­cen­tivized in ma­chine learn­ing sys­tems is likely to be de­pen­dent on which of these lev­els it is more ad­van­ta­geous for the sys­tem to perform op­ti­miza­tion. For many cur­rent ma­chine learn­ing mod­els, where we ex­pend vastly more com­pu­ta­tional re­sources train­ing the model than run­ning it, it seems gen­er­ally fa­vor­able for most of the op­ti­miza­tion work to be done by the base op­ti­mizer, with the re­sult­ing learned al­gorithm be­ing sim­ply a net­work of highly-tuned heuris­tics rather than a mesa-op­ti­mizer.

We are already en­coun­ter­ing some prob­lems, how­ever—Go, Chess, and Shogi, for ex­am­ple—for which this ap­proach does not scale. In­deed, our best cur­rent al­gorithms for those tasks in­volve ex­plic­itly mak­ing an op­ti­mizer (hard-coded Monte-Carlo tree search with learned heuris­tics) that does op­ti­miza­tion work on the level of the learned al­gorithm rather than hav­ing all the op­ti­miza­tion work done by the base op­ti­mizer.(10) Ar­guably, this sort of task is only ad­e­quately solv­able this way—if it were pos­si­ble to train a straight­for­ward DQN agent to perform well at Chess, it plau­si­bly would have to learn to in­ter­nally perform some­thing like a tree search, pro­duc­ing a mesa-op­ti­mizer.[3]

We hy­poth­e­size that the at­trac­tive­ness of search in these do­mains is due to the di­verse, branch­ing na­ture of these en­vi­ron­ments. This is be­cause search—that is, op­ti­miza­tion—tends to be good at gen­er­al­iz­ing across di­verse en­vi­ron­ments, as it gets to in­di­vi­d­u­ally de­ter­mine the best ac­tion for each in­di­vi­d­ual task in­stance. There is a gen­eral dis­tinc­tion along these lines be­tween op­ti­miza­tion work done on the level of the learned al­gorithm and that done on the level of the base op­ti­mizer: the learned al­gorithm only has to de­ter­mine the best ac­tion for a given task in­stance, whereas the base op­ti­mizer has to de­sign heuris­tics that will hold re­gard­less of what task in­stance the learned al­gorithm en­coun­ters. Fur­ther­more, a mesa-op­ti­mizer can im­me­di­ately op­ti­mize its ac­tions in novel situ­a­tions, whereas the base op­ti­mizer can only change the mesa-op­ti­mizer’s policy by mod­ify­ing it ex-post. Thus, for en­vi­ron­ments that are di­verse enough that most task in­stances are likely to be com­pletely novel, search al­lows the mesa-op­ti­mizer to ad­just for that new task in­stance im­me­di­ately.

For ex­am­ple, con­sider re­in­force­ment learn­ing in a di­verse en­vi­ron­ment, such as one that di­rectly in­volves in­ter­act­ing with the real world. We can think of a di­verse en­vi­ron­ment as re­quiring a very large amount of com­pu­ta­tion to figure out good poli­cies be­fore con­di­tion­ing on the speci­fics of an in­di­vi­d­ual in­stance, but only a much smaller amount of com­pu­ta­tion to figure out a good policy once the spe­cific in­stance of the en­vi­ron­ment is known. We can model this ob­ser­va­tion as fol­lows.

Sup­pose an en­vi­ron­ment is com­posed of differ­ent in­stances, each of which re­quires a com­pletely dis­tinct policy to suc­ceed in.[4] Let be the op­ti­miza­tion power (mea­sured in bits(9)) ap­plied by the base op­ti­mizer, which should be ap­prox­i­mately pro­por­tional to the num­ber of train­ing steps. Then, let be the op­ti­miza­tion power ap­plied by the learned al­gorithm in each en­vi­ron­ment in­stance and the to­tal amount of op­ti­miza­tion power the base op­ti­mizer must put in to get a learned al­gorithm ca­pa­ble of perform­ing that amount of op­ti­miza­tion.[5] We will as­sume that the rest of the base op­ti­mizer’s op­ti­miza­tion power, , goes into tun­ing the learned al­gorithm’s policy. Since the base op­ti­mizer has to dis­tribute its tun­ing across all task in­stances, the amount of op­ti­miza­tion power it will be able to con­tribute to each in­stance will be , un­der the pre­vi­ous as­sump­tion that each in­stance re­quires a com­pletely dis­tinct policy. On the other hand, since the learned al­gorithm does all of its op­ti­miza­tion at run­time, it can di­rect all of it into the given task in­stance, mak­ing its con­tri­bu­tion to the to­tal for each in­stance sim­ply .[6]

Thus, if we as­sume that, for a given , the base op­ti­mizer will se­lect the value of that max­i­mizes the min­i­mum level of perfor­mance, and thus the to­tal op­ti­miza­tion power ap­plied to each in­stance, we get[7]

As one moves to more and more di­verse en­vi­ron­ments—that is, as in­creases—this model sug­gests that will dom­i­nate , im­ply­ing that mesa-op­ti­miza­tion will be­come more and more fa­vor­able. Of course, this is sim­ply a toy model, as it makes many ques­tion­able sim­plify­ing as­sump­tions. Nev­er­the­less, it sketches an ar­gu­ment for a pull to­wards mesa-op­ti­miza­tion in suffi­ciently di­verse en­vi­ron­ments.

As an illus­tra­tive ex­am­ple, con­sider biolog­i­cal evolu­tion. The en­vi­ron­ment of the real world is highly di­verse, re­sult­ing in non-op­ti­mizer poli­cies di­rectly fine-tuned by evolu­tion—those of plants, for ex­am­ple—hav­ing to be very sim­ple, as evolu­tion has to spread its op­ti­miza­tion power across a very wide range of pos­si­ble en­vi­ron­ment in­stances. On the other hand, an­i­mals with ner­vous sys­tems can dis­play sig­nifi­cantly more com­plex poli­cies by virtue of be­ing able to perform their own op­ti­miza­tion, which can be based on im­me­di­ate in­for­ma­tion from their en­vi­ron­ment. This al­lows suffi­ciently ad­vanced mesa-op­ti­miz­ers, such as hu­mans, to mas­sively out­perform other species, es­pe­cially in the face of novel en­vi­ron­ments, as the op­ti­miza­tion performed in­ter­nally by hu­mans al­lows them to find good poli­cies even in en­tirely novel en­vi­ron­ments.

Com­pres­sion of com­plex poli­cies. In some tasks, good perfor­mance re­quires a very com­plex policy. At the same time, base op­ti­miz­ers are gen­er­ally bi­ased in fa­vor of se­lect­ing learned al­gorithms with lower com­plex­ity. Thus, all else be­ing equal, the base op­ti­mizer will gen­er­ally be in­cen­tivized to look for a highly com­pressed policy.

One way to find a com­pressed policy is to search for one that is able to use gen­eral fea­tures of the task struc­ture to pro­duce good be­hav­ior, rather than sim­ply mem­o­riz­ing the cor­rect out­put for each in­put. A mesa-op­ti­mizer is an ex­am­ple of such a policy. From the per­spec­tive of the base op­ti­mizer, a mesa-op­ti­mizer is a highly-com­pressed ver­sion of what­ever policy it ends up im­ple­ment­ing: in­stead of ex­plic­itly en­cod­ing the de­tails of that policy in the learned al­gorithm, the base op­ti­mizer sim­ply needs to en­code how to search for such a policy. Fur­ther­more, if a mesa-op­ti­mizer can de­ter­mine the im­por­tant fea­tures of its en­vi­ron­ment at run­time, it does not need to be given as much prior in­for­ma­tion as to what those im­por­tant fea­tures are, and can thus be much sim­pler.

This effect is most pro­nounced for tasks with a broad di­ver­sity of de­tails but com­mon high-level fea­tures. For ex­am­ple, Go, Chess, and Shogi have a very large do­main of pos­si­ble board states, but ad­mit a sin­gle high-level strat­egy for play—heuris­tic-guided tree search—that performs well across all board states.(10) On the other hand, a clas­sifier trained on ran­dom noise is un­likely to benefit from com­pres­sion at all.

The en­vi­ron­ment need not nec­es­sar­ily be too di­verse for this sort of effect to ap­pear, how­ever, as long as the pres­sure for low de­scrip­tion length is strong enough. As a sim­ple illus­tra­tive ex­am­ple, con­sider the fol­low­ing task: given a maze, the learned al­gorithm must out­put a path through the maze from start to finish. If the maze is suffi­ciently long and com­pli­cated then the spe­cific strat­egy for solv­ing this par­tic­u­lar maze—spec­i­fy­ing each in­di­vi­d­ual turn—will have a high de­scrip­tion length. How­ever, the de­scrip­tion length of a gen­eral op­ti­miza­tion al­gorithm for find­ing a path through an ar­bi­trary maze is fairly small. There­fore, if the base op­ti­mizer is se­lect­ing for pro­grams with low de­scrip­tion length, then it might find a mesa-op­ti­mizer that can solve all mazes, de­spite the train­ing en­vi­ron­ment only con­tain­ing one maze.

Task re­stric­tion. The ob­ser­va­tion that di­verse en­vi­ron­ments seem to in­crease the prob­a­bil­ity of mesa-op­ti­miza­tion sug­gests that one way of re­duc­ing the prob­a­bil­ity of mesa-op­ti­miz­ers might be to keep the tasks on which AI sys­tems are trained highly re­stricted. Fo­cus­ing on build­ing many in­di­vi­d­ual AI ser­vices which can to­gether offer all the ca­pa­bil­ities of a gen­er­ally-in­tel­li­gent sys­tem rather than a sin­gle gen­eral-pur­pose ar­tifi­cial gen­eral in­tel­li­gence (AGI), for ex­am­ple, might be a way to ac­com­plish this while still re­main­ing com­pet­i­tive with other ap­proaches.(11)

Hu­man mod­el­ing. Another as­pect of the task that might in­fluence the like­li­hood of mesa-op­ti­miza­tion is the pres­ence of hu­mans in the en­vi­ron­ment.(12) Since hu­mans of­ten act as op­ti­miz­ers, rea­son­ing about hu­mans will likely in­volve rea­son­ing about op­ti­miza­tion. A sys­tem ca­pa­ble of rea­son­ing about op­ti­miza­tion is likely also ca­pa­ble of reusing that same ma­chin­ery to do op­ti­miza­tion it­self, re­sult­ing in a mesa-op­ti­mizer. For ex­am­ple, it might be the case that pre­dict­ing hu­man be­hav­ior re­quires in­stan­ti­at­ing a pro­cess similar to hu­man judg­ment, com­plete with in­ter­nal mo­tives for mak­ing one de­ci­sion over an­other.

Thus, tasks that do not benefit from hu­man mod­el­ing seem less likely to pro­duce mesa-op­ti­miz­ers than those that do. Fur­ther­more, there are many tasks that might benefit from hu­man mod­el­ing that don’t ex­plic­itly in­volve mod­el­ing hu­mans—to the ex­tent that the train­ing dis­tri­bu­tion is gen­er­ated by hu­mans, for ex­am­ple, mod­el­ing hu­mans might en­able the gen­er­a­tion of a very good prior for that dis­tri­bu­tion.

2.2. The base optimizer

It is likely that cer­tain fea­tures of the base op­ti­mizer will in­fluence how likely it is to gen­er­ate a mesa-op­ti­mizer. First, though we largely fo­cus on re­in­force­ment learn­ing in this se­quence, RL is not nec­es­sar­ily the only type of ma­chine learn­ing where mesa-op­ti­miz­ers could ap­pear. For ex­am­ple, it seems plau­si­ble that mesa-op­ti­miz­ers could ap­pear in gen­er­a­tive ad­ver­sar­ial net­works.

Se­cond, we hy­poth­e­size that the de­tails of a ma­chine learn­ing model’s ar­chi­tec­ture will have a sig­nifi­cant effect on its ten­dency to im­ple­ment mesa-op­ti­miza­tion. For ex­am­ple, a tab­u­lar model, which in­de­pen­dently learns the cor­rect out­put for ev­ery in­put, will never be a mesa-op­ti­mizer. On the other hand, if a hy­po­thet­i­cal base op­ti­mizer looks for the pro­gram with the short­est source code that solves a task, then it is more plau­si­ble that this pro­gram will it­self be an op­ti­mizer.(13) How­ever, for re­al­is­tic ma­chine learn­ing base op­ti­miz­ers, it is less clear to what ex­tent mesa-op­ti­miz­ers will be se­lected for. Thus, we dis­cuss some fac­tors be­low that might in­fluence the like­li­hood of mesa-op­ti­miza­tion one way or the other.

Reach­a­bil­ity. There are many kinds of op­ti­miza­tion al­gorithms that a base op­ti­mizer could im­ple­ment. How­ever, al­most ev­ery train­ing strat­egy cur­rently used in ma­chine learn­ing uses some form of lo­cal search (such as gra­di­ent de­scent or even ge­netic al­gorithms). Thus, it seems plau­si­ble that the train­ing strat­egy of more ad­vanced ML sys­tems will also fall into this cat­e­gory. We will call this gen­eral class of op­ti­miz­ers that are based on lo­cal hill-climb­ing lo­cal op­ti­miza­tion pro­cesses.

We can then for­mu­late a no­tion of reach­a­bil­ity, the difficulty for the base op­ti­mizer to find any given learned al­gorithm, which we can an­a­lyze in the case of a lo­cal op­ti­miza­tion pro­cess. A lo­cal op­ti­miza­tion pro­cess might fail to find a par­tic­u­lar learned al­gorithm that would perform very well on the base ob­jec­tive if the learned al­gorithm is sur­rounded by other al­gorithms that perform poorly on the base ob­jec­tive. For a mesa-op­ti­mizer to be pro­duced by a lo­cal op­ti­miza­tion pro­cess, it needs to not only perform well on the base ob­jec­tive, but also be reach­able; that is, there needs to be a path through the space of learned al­gorithms to it that is ap­prox­i­mately mono­ton­i­cally in­creas­ing. Fur­ther­more, the de­gree to which the path only need be ap­prox­i­mate—that is, the de­gree to which ML train­ing pro­ce­dures can es­cape lo­cal op­tima—is likely to be crit­i­cal, as op­ti­miza­tion al­gorithms are com­plex enough that it might re­quire a sig­nifi­cant por­tion of the al­gorithm to be pre­sent be­fore perfor­mance gains start be­ing re­al­ized.

Al­gorith­mic range. One key fac­tor likely to de­ter­mine the reach­a­bil­ity of mesa-op­ti­miz­ers is the al­gorith­mic range of the learned al­gorithms—that is, how ex­ten­sive is the set of al­gorithms (or how ex­pres­sive is the model space) ca­pa­ble of be­ing found by the base op­ti­mizer. The more ex­ten­sive a model’s al­gorith­mic range, the broader the space of pos­si­ble learned al­gorithms, and thus the more likely that it will be able to find one that is a mesa-op­ti­mizer, as­sum­ing the base op­ti­mizer is in­cen­tivized to do so. For ex­am­ple, ar­chi­tec­tures that ex­plic­itly give the al­gorithm ac­cess to a wide range of pos­si­ble com­pu­ta­tions, such as re­cur­rent neu­ral net­works or neu­ral Tur­ing ma­chines,(14) seem more likely to pro­duce mesa-op­ti­miz­ers.

In­duc­tive bi­ases. Another im­por­tant fac­tor is the de­gree to which the base op­ti­mizer is ex­plic­itly or im­plic­itly bi­ased in var­i­ous ways. The na­ture of these in­duc­tive bi­ases will con­tribute to the like­li­hood of a mesa-op­ti­mizer be­ing se­lected for. One of the most im­por­tant kinds of in­duc­tive bias is sim­plic­ity bias, which would al­most cer­tainly be ex­hibited by al­most all base op­ti­miz­ers. We iden­tify three ways in which sim­plic­ity bias can man­i­fest it­self:

  1. An ex­plicit penalty due to pa­ram­e­ter reg­u­lariza­tion or ar­chi­tec­tural con­straints such as weight-shar­ing or sparse con­nec­tions.

  2. An im­plicit bias due to the model ar­chi­tec­ture. For ex­am­ple, it has been shown that neu­ral net­works are more likely to fit a sim­ple func­tion to a set of train­ing data, even when no reg­u­lariza­tion is used.(15)

  3. The ca­pac­ity limi­ta­tions of the model. The size of a model im­poses a hard limit on the com­plex­ity of the func­tions it is able to rep­re­sent. Thus, to the de­gree that the base op­ti­mizer is se­lect­ing based on perfor­mance, it will be driven to “squeeze out” as much perfor­mance as it can for any given model ca­pac­ity, lead­ing to a bias in fa­vor of rel­a­tively com­pressed poli­cies.

The more a base op­ti­mizer is bi­ased to­wards sim­ple solu­tions, the more it will be in­cen­tivized to find a com­pressed policy such as a mesa-op­ti­mizer.

The base op­ti­mizer could also be bi­ased in other ways. For ex­am­ple, it could be bi­ased to­wards al­gorithms with a low time or space com­plex­ity. If we view mesa-op­ti­miza­tion as del­e­ga­tion of op­ti­miza­tion work, a time penalty on the learned al­gorithm is likely to in­cen­tivize the base op­ti­mizer to do more pre­com­pu­ta­tion it­self by pre-com­put­ing rele­vant heuris­tics rather than del­e­gat­ing op­ti­miza­tion work to the learned al­gorithm. Thus, we hy­poth­e­size that pe­nal­iz­ing de­scrip­tion length will fa­vor mesa-op­ti­miz­ers while pe­nal­iz­ing time com­plex­ity will dis­fa­vor them. This sug­gests the fol­low­ing spec­u­la­tive con­jec­ture: nei­ther a min­i­mal-depth nor min­i­mal-size boolean cir­cuit that solves a prob­lem can be a mesa-op­ti­mizer.(16)

Lastly, an­other form of bias that might have par­tic­u­larly in­ter­est­ing effects is the pres­ence of an in­for­ma­tion fun­nel. In deep learn­ing, the base op­ti­mizer di­rectly se­lects the val­ues of in­di­vi­d­ual pa­ram­e­ters. In biolog­i­cal evolu­tion, by con­trast, the base op­ti­mizer se­lects DNA strings, which in turn pro­duce the de­tailed struc­ture of the brain only in­di­rectly. This im­plic­itly in­tro­duces pres­sure to­wards com­press­ing the brain’s struc­ture. As we noted pre­vi­ously, this might fa­vor the de­vel­op­ment of mesa-op­ti­miz­ers, as they can be seen as a way of com­press­ing com­plex poli­cies. Be­cause deep learn­ing al­gorithms cur­rently se­lect weights di­rectly, there is no such in­for­ma­tion fun­nel. In the case of meta-learn­ing over a ma­chine learn­ing model’s ar­chi­tec­ture, how­ever, an in­for­ma­tion fun­nel might pre­sent it­self due to the fact that the meta-learner has to learn a sin­gle policy for cre­at­ing ar­chi­tec­tures that is ca­pa­ble of solv­ing all of the prob­lems it is pre­sented with.

State­ful­ness. Another fac­tor that could make it eas­ier for a learned al­gorithm to im­ple­ment a com­plex op­ti­miza­tion pro­cess is the abil­ity of the learned al­gorithm to save and re­call in­for­ma­tion. The abil­ity to re­fer back to pre­vi­ous state en­ables the caching of in­ter­me­di­ate re­sults, in­creas­ing the ar­chi­tec­ture’s al­gorith­mic range by giv­ing the learned al­gorithm sig­nifi­cantly more time to com­plete its com­pu­ta­tions. State­ful­ness, there­fore, fa­vors mesa-op­ti­miza­tion by de­creas­ing the im­plicit penalty on time com­plex­ity im­posed by en­forc­ing all com­pu­ta­tions to end when the learned al­gorithm pro­duces its out­put. One sim­ple form of this might be a re­cur­rent neu­ral net­work, which can perform com­pu­ta­tions with longer time hori­zons due to its abil­ity to pass in­ter­me­di­ate ac­ti­va­tions across differ­ent time steps. An even more ex­treme ver­sion of this might be a net­work that is able to store state not just across differ­ent time steps in the same epi­sode, but also across epi­sodes, en­abling it to do more long-term learn­ing, though such an ar­chi­tec­ture is cur­rently com­pletely hy­po­thet­i­cal.

Hard-coded op­ti­miza­tion. One pos­si­ble means of alle­vi­at­ing some of these is­sues might be to in­clude hard-coded op­ti­miza­tion where the learned al­gorithm pro­vides only the ob­jec­tive func­tion and not the op­ti­miza­tion al­gorithm. The stronger the op­ti­miza­tion performed ex­plic­itly, the less strong the op­ti­miza­tion performed im­plic­itly by the learned al­gorithm needs to be. For ex­am­ple, ar­chi­tec­tures that ex­plic­itly perform op­ti­miza­tion that is rele­vant for the task—such as hard-coded Monte Carlo tree search—might de­crease the benefit of mesa-op­ti­miz­ers by re­duc­ing the need for op­ti­miza­tion other than that which is ex­plic­itly pro­grammed into the sys­tem.

The third post in the Risks from Learned Op­ti­miza­tion Se­quence, ti­tled “The In­ner Align­ment Prob­lem,” can be found here.

Glos­sary | Bibliography

  1. As of the date of this post. Note that we do ex­am­ine some ex­ist­ing ma­chine learn­ing sys­tems that we be­lieve are close to pro­duc­ing mesa-op­ti­miza­tion in post 5. ↩︎

  2. It is worth not­ing that the same ar­gu­ment also holds for achiev­ing an av­er­age-case guaran­tee. ↩︎

  3. As­sum­ing rea­son­able com­pu­ta­tional con­straints.. ↩︎

  4. This defi­ni­tion of is some­what vague, as there are mul­ti­ple differ­ent lev­els at which one can chunk an en­vi­ron­ment into in­stances. For ex­am­ple, one en­vi­ron­ment could always have the same high-level fea­tures but com­pletely ran­dom low-level fea­tures, whereas an­other could have two differ­ent cat­e­gories of in­stances that are broadly self-similar but differ­ent from each other, in which case it’s un­clear which has a larger . How­ever, one can sim­ply imag­ine hold­ing con­stant for all lev­els but one and just con­sid­er­ing how en­vi­ron­ment di­ver­sity changes on that level. ↩︎

  5. Note that this makes the im­plicit as­sump­tion that the amount of op­ti­miza­tion power re­quired to find a mesa-op­ti­mizer ca­pa­ble of perform­ing bits of op­ti­miza­tion is in­de­pen­dent of . The jus­tifi­ca­tion for this is that op­ti­miza­tion is a gen­eral al­gorithm that looks the same re­gard­less of what en­vi­ron­ment it is ap­plied to, so the amount of op­ti­miza­tion re­quired to find an -bit op­ti­mizer should be rel­a­tively in­de­pen­dent of the en­vi­ron­ment. That be­ing said, it won’t be com­pletely in­de­pen­dent, but as long as the pri­mary differ­ence be­tween en­vi­ron­ments is how much op­ti­miza­tion they need, rather than how hard it is to do op­ti­miza­tion, the model pre­sented here should hold. ↩︎

  6. Note, how­ever, that there will be some max­i­mum sim­ply be­cause the learned al­gorithm gen­er­ally only has ac­cess to so much com­pu­ta­tional power. ↩︎

  7. Sub­ject to the con­straint that . ↩︎