Risks from Learned Optimization: Introduction

This is the first of five posts in the Risks from Learned Op­ti­miza­tion Se­quence based on the pa­per “Risks from Learned Op­ti­miza­tion in Ad­vanced Ma­chine Learn­ing Sys­tems” by Evan Hub­inger, Chris van Mer­wijk, Vladimir Mikulik, Joar Skalse, and Scott Garrabrant. Each post in the se­quence cor­re­sponds to a differ­ent sec­tion of the pa­per.

Evan Hub­inger, Chris van Mer­wijk, Vladimir Mikulik, and Joar Skalse con­tributed equally to this se­quence. With spe­cial thanks to Paul Chris­ti­ano, Eric Drexler, Rob Bens­inger, Jan Leike, Ro­hin Shah, William Saun­ders, Buck Sh­legeris, David Dalrym­ple, Abram Dem­ski, Stu­art Arm­strong, Linda Linse­fors, Carl Shul­man, Toby Ord, Kate Woolver­ton, and ev­ery­one else who pro­vided feed­back on ear­lier ver­sions of this se­quence.


The goal of this se­quence is to an­a­lyze the type of learned op­ti­miza­tion that oc­curs when a learned model (such as a neu­ral net­work) is it­self an op­ti­mizer—a situ­a­tion we re­fer to as mesa-op­ti­miza­tion, a ne­ol­o­gism we in­tro­duce in this se­quence. We be­lieve that the pos­si­bil­ity of mesa-op­ti­miza­tion raises two im­por­tant ques­tions for the safety and trans­parency of ad­vanced ma­chine learn­ing sys­tems. First, un­der what cir­cum­stances will learned mod­els be op­ti­miz­ers, in­clud­ing when they should not be? Se­cond, when a learned model is an op­ti­mizer, what will its ob­jec­tive be—how will it differ from the loss func­tion it was trained un­der—and how can it be al­igned?

We be­lieve that this se­quence pre­sents the most thor­ough anal­y­sis of these ques­tions that has been con­ducted to date. In par­tic­u­lar, we plan to pre­sent not only an in­tro­duc­tion to the ba­sic con­cerns sur­round­ing mesa-op­ti­miz­ers, but also an anal­y­sis of the par­tic­u­lar as­pects of an AI sys­tem that we be­lieve are likely to make the prob­lems re­lated to mesa-op­ti­miza­tion rel­a­tively eas­ier or harder to solve. By pro­vid­ing a frame­work for un­der­stand­ing the de­gree to which differ­ent AI sys­tems are likely to be ro­bust to mis­al­igned mesa-op­ti­miza­tion, we hope to start a dis­cus­sion about the best ways of struc­tur­ing ma­chine learn­ing sys­tems to solve these prob­lems. Fur­ther­more, in the fourth post we will provide what we think is the most de­tailed anal­y­sis yet of a prob­lem we re­fer as de­cep­tive al­ign­ment which we posit may pre­sent one of the largest—though not nec­es­sar­ily in­sur­mountable—cur­rent ob­sta­cles to pro­duc­ing safe ad­vanced ma­chine learn­ing sys­tems us­ing tech­niques similar to mod­ern ma­chine learn­ing.

Two questions

In ma­chine learn­ing, we do not man­u­ally pro­gram each in­di­vi­d­ual pa­ram­e­ter of our mod­els. In­stead, we spec­ify an ob­jec­tive func­tion that cap­tures what we want the sys­tem to do and a learn­ing al­gorithm to op­ti­mize the sys­tem for that ob­jec­tive. In this post, we pre­sent a frame­work that dis­t­in­guishes what a sys­tem is op­ti­mized to do (its “pur­pose”), from what it op­ti­mizes for (its “goal”), if it op­ti­mizes for any­thing at all. While all AI sys­tems are op­ti­mized for some­thing (have a pur­pose), whether they ac­tu­ally op­ti­mize for any­thing (pur­sue a goal) is non-triv­ial. We will say that a sys­tem is an op­ti­mizer if it is in­ter­nally search­ing through a search space (con­sist­ing of pos­si­ble out­puts, poli­cies, plans, strate­gies, or similar) look­ing for those el­e­ments that score high ac­cord­ing to some ob­jec­tive func­tion that is ex­plic­itly rep­re­sented within the sys­tem. Learn­ing al­gorithms in ma­chine learn­ing are op­ti­miz­ers be­cause they search through a space of pos­si­ble pa­ram­e­ters—e.g. neu­ral net­work weights—and im­prove the pa­ram­e­ters with re­spect to some ob­jec­tive. Plan­ning al­gorithms are also op­ti­miz­ers, since they search through pos­si­ble plans, pick­ing those that do well ac­cord­ing to some ob­jec­tive.

Whether a sys­tem is an op­ti­mizer is a prop­erty of its in­ter­nal struc­ture—what al­gorithm it is phys­i­cally im­ple­ment­ing—and not a prop­erty of its in­put-out­put be­hav­ior. Im­por­tantly, the fact that a sys­tem’s be­hav­ior re­sults in some ob­jec­tive be­ing max­i­mized does not make the sys­tem an op­ti­mizer. For ex­am­ple, a bot­tle cap causes wa­ter to be held in­side the bot­tle, but it is not op­ti­miz­ing for that out­come since it is not run­ning any sort of op­ti­miza­tion al­gorithm.(1) Rather, bot­tle caps have been op­ti­mized to keep wa­ter in place. The op­ti­mizer in this situ­a­tion is the hu­man that de­signed the bot­tle cap by search­ing through the space of pos­si­ble tools for one to suc­cess­fully hold wa­ter in a bot­tle. Similarly, image-clas­sify­ing neu­ral net­works are op­ti­mized to achieve low er­ror in their clas­sifi­ca­tions, but are not, in gen­eral, them­selves perform­ing op­ti­miza­tion.

How­ever, it is also pos­si­ble for a neu­ral net­work to it­self run an op­ti­miza­tion al­gorithm. For ex­am­ple, a neu­ral net­work could run a plan­ning al­gorithm that pre­dicts the out­comes of po­ten­tial plans and searches for those it pre­dicts will re­sult in some de­sired out­come.[1] Such a neu­ral net­work would it­self be an op­ti­mizer be­cause it would be search­ing through the space of pos­si­ble plans ac­cord­ing to some ob­jec­tive func­tion. If such a neu­ral net­work were pro­duced in train­ing, there would be two op­ti­miz­ers: the learn­ing al­gorithm that pro­duced the neu­ral net­work—which we will call the base op­ti­mizer—and the neu­ral net­work it­self—which we will call the mesa-op­ti­mizer.[2]

The pos­si­bil­ity of mesa-op­ti­miz­ers has im­por­tant im­pli­ca­tions for the safety of ad­vanced ma­chine learn­ing sys­tems. When a base op­ti­mizer gen­er­ates a mesa-op­ti­mizer, safety prop­er­ties of the base op­ti­mizer may not trans­fer to the mesa-op­ti­mizer. Thus, we ex­plore two pri­mary ques­tions re­lated to the safety of mesa-op­ti­miz­ers:

  1. Mesa-op­ti­miza­tion: Un­der what cir­cum­stances will learned al­gorithms be op­ti­miz­ers?

  2. In­ner al­ign­ment: When a learned al­gorithm is an op­ti­mizer, what will its ob­jec­tive be, and how can it be al­igned?

Once we have in­tro­duced our frame­work in this post, we will ad­dress the first ques­tion in the sec­ond, be­gin ad­dress­ing the sec­ond ques­tion in the third post, and fi­nally delve deeper into a spe­cific as­pect of the sec­ond ques­tion in the fourth post.

1.1. Base op­ti­miz­ers and mesa-optimizers

Con­ven­tion­ally, the base op­ti­mizer in a ma­chine learn­ing setup is some sort of gra­di­ent de­scent pro­cess with the goal of cre­at­ing a model de­signed to ac­com­plish some spe­cific task.

Some­times, this pro­cess will also in­volve some de­gree of meta-op­ti­miza­tion wherein a meta-op­ti­mizer is tasked with pro­duc­ing a base op­ti­mizer that is it­self good at op­ti­miz­ing sys­tems to achieve par­tic­u­lar goals. Speci­fi­cally, we will think of a meta-op­ti­mizer as any sys­tem whose task is op­ti­miza­tion. For ex­am­ple, we might de­sign a meta-learn­ing sys­tem to help tune our gra­di­ent de­scent pro­cess.(4) Though the model found by meta-op­ti­miza­tion can be thought of as a kind of learned op­ti­mizer, it is not the form of learned op­ti­miza­tion that we are in­ter­ested in for this se­quence. Rather, we are con­cerned with a differ­ent form of learned op­ti­miza­tion which we call mesa-op­ti­miza­tion.

Mesa-op­ti­miza­tion is a con­cep­tual dual of meta-op­ti­miza­tion—whereas meta is Greek for above, mesa is Greek for be­low.[3] Mesa-op­ti­miza­tion oc­curs when a base op­ti­mizer (in search­ing for al­gorithms to solve some prob­lem) finds a model that is it­self an op­ti­mizer, which we will call a mesa-op­ti­mizer. Un­like meta-op­ti­miza­tion, in which the task it­self is op­ti­miza­tion, mesa-op­ti­miza­tion is task-in­de­pen­dent, and sim­ply refers to any situ­a­tion where the in­ter­nal struc­ture of the model ends up perform­ing op­ti­miza­tion be­cause it is in­stru­men­tally use­ful for solv­ing the given task.

In such a case, we will use base ob­jec­tive to re­fer to what­ever crite­rion the base op­ti­mizer was us­ing to se­lect be­tween differ­ent pos­si­ble sys­tems and mesa-ob­jec­tive to re­fer to what­ever crite­rion the mesa-op­ti­mizer is us­ing to se­lect be­tween differ­ent pos­si­ble out­puts. In re­in­force­ment learn­ing (RL), for ex­am­ple, the base ob­jec­tive is gen­er­ally the ex­pected re­turn. Un­like the base ob­jec­tive, the mesa-ob­jec­tive is not speci­fied di­rectly by the pro­gram­mers. Rather, the mesa-ob­jec­tive is sim­ply what­ever ob­jec­tive was found by the base op­ti­mizer that pro­duced good perfor­mance on the train­ing en­vi­ron­ment. Be­cause the mesa-ob­jec­tive is not speci­fied by the pro­gram­mers, mesa-op­ti­miza­tion opens up the pos­si­bil­ity of a mis­match be­tween the base and mesa- ob­jec­tives, wherein the mesa-ob­jec­tive might seem to perform well on the train­ing en­vi­ron­ment but lead to bad perfor­mance off the train­ing en­vi­ron­ment. We will re­fer to this case as pseudo-al­ign­ment be­low.

There need not always be a mesa-ob­jec­tive since the al­gorithm found by the base op­ti­mizer will not always be perform­ing op­ti­miza­tion. Thus, in the gen­eral case, we will re­fer to the model gen­er­ated by the base op­ti­mizer as a learned al­gorithm, which may or may not be a mesa-op­ti­mizer.

Figure 1.1. The re­la­tion­ship be­tween the base and mesa- op­ti­miz­ers. The base op­ti­mizer op­ti­mizes the learned al­gorithm based on its perfor­mance on the base ob­jec­tive. In or­der to do so, the base op­ti­mizer may have turned this learned al­gorithm into a mesa-op­ti­mizer, in which case the mesa-op­ti­mizer it­self runs an op­ti­miza­tion al­gorithm based on its own mesa-ob­jec­tive. Re­gard­less, it is the learned al­gorithm that di­rectly takes ac­tions based on its in­put.

Pos­si­ble mi­s­un­der­stand­ing: “mesa-op­ti­mizer” does not mean “sub­sys­tem” or “sub­agent.” In the con­text of deep learn­ing, a mesa-op­ti­mizer is sim­ply a neu­ral net­work that is im­ple­ment­ing some op­ti­miza­tion pro­cess and not some emer­gent sub­agent in­side that neu­ral net­work. Mesa-op­ti­miz­ers are sim­ply a par­tic­u­lar type of al­gorithm that the base op­ti­mizer might find to solve its task. Fur­ther­more, we will gen­er­ally be think­ing of the base op­ti­mizer as a straight­for­ward op­ti­miza­tion al­gorithm, and not as an in­tel­li­gent agent choos­ing to cre­ate a sub­agent.[4]

We dis­t­in­guish the mesa-ob­jec­tive from a re­lated no­tion that we term the be­hav­ioral ob­jec­tive. In­for­mally, the be­hav­ioral ob­jec­tive is the ob­jec­tive which ap­pears to be op­ti­mized by the sys­tem’s be­hav­ior. We can op­er­a­tional­ize the be­hav­ioral ob­jec­tive as the ob­jec­tive re­cov­ered from perfect in­verse re­in­force­ment learn­ing (IRL).[5] This is in con­trast to the mesa-ob­jec­tive, which is the ob­jec­tive ac­tively be­ing used by the mesa-op­ti­mizer in its op­ti­miza­tion al­gorithm.

Ar­guably, any pos­si­ble sys­tem has a be­hav­ioral ob­jec­tive—in­clud­ing bricks and bot­tle caps. How­ever, for non-op­ti­miz­ers, the ap­pro­pri­ate be­hav­ioral ob­jec­tive might just be “1 if the ac­tions taken are those that are in fact taken by the sys­tem and 0 oth­er­wise,”[6] and it is thus nei­ther in­ter­est­ing nor use­ful to know that the sys­tem is act­ing to op­ti­mize this ob­jec­tive. For ex­am­ple, the be­hav­ioral ob­jec­tive “op­ti­mized” by a bot­tle cap is the ob­jec­tive of be­hav­ing like a bot­tle cap.[7] How­ever, if the sys­tem is an op­ti­mizer, then it is more likely that it will have a mean­ingful be­hav­ioral ob­jec­tive. That is, to the de­gree that a mesa-op­ti­mizer’s out­put is sys­tem­at­i­cally se­lected to op­ti­mize its mesa-ob­jec­tive, its be­hav­ior may look more like co­her­ent at­tempts to move the world in a par­tic­u­lar di­rec­tion.[8]

A given mesa-op­ti­mizer’s mesa-ob­jec­tive is de­ter­mined en­tirely by its in­ter­nal work­ings. Once train­ing is finished and a learned al­gorithm is se­lected, its di­rect out­put—e.g. the ac­tions taken by an RL agent—no longer de­pends on the base ob­jec­tive. Thus, it is the mesa-ob­jec­tive, not the base ob­jec­tive, that de­ter­mines a mesa-op­ti­mizer’s be­hav­ioral ob­jec­tive. Of course, to the de­gree that the learned al­gorithm was se­lected on the ba­sis of the base ob­jec­tive, its out­put will score well on the base ob­jec­tive. How­ever, in the case of a dis­tri­bu­tional shift, we should ex­pect a mesa-op­ti­mizer’s be­hav­ior to more ro­bustly op­ti­mize for the mesa-ob­jec­tive since its be­hav­ior is di­rectly com­puted ac­cord­ing to it.

As an ex­am­ple to illus­trate the base/​mesa dis­tinc­tion in a differ­ent do­main, and the pos­si­bil­ity of mis­al­ign­ment be­tween the base and mesa- ob­jec­tives, con­sider biolog­i­cal evolu­tion. To a first ap­prox­i­ma­tion, evolu­tion se­lects or­ganisms ac­cord­ing to the ob­jec­tive func­tion of their in­clu­sive ge­netic fit­ness in some en­vi­ron­ment.[9] Most of these biolog­i­cal or­ganisms—plants, for ex­am­ple—are not “try­ing” to achieve any­thing, but in­stead merely im­ple­ment heuris­tics that have been pre-se­lected by evolu­tion. How­ever, some or­ganisms, such as hu­mans, have be­hav­ior that does not merely con­sist of such heuris­tics but is in­stead also the re­sult of goal-di­rected op­ti­miza­tion al­gorithms im­ple­mented in the brains of these or­ganisms. Be­cause of this, these or­ganisms can perform be­hav­ior that is com­pletely novel from the per­spec­tive of the evolu­tion­ary pro­cess, such as hu­mans build­ing com­put­ers.

How­ever, hu­mans tend not to place ex­plicit value on evolu­tion’s ob­jec­tive, at least in terms of car­ing about their alle­les’ fre­quency in the pop­u­la­tion. The ob­jec­tive func­tion stored in the hu­man brain is not the same as the ob­jec­tive func­tion of evolu­tion. Thus, when hu­mans dis­play novel be­hav­ior op­ti­mized for their own ob­jec­tives, they can perform very poorly ac­cord­ing to evolu­tion’s ob­jec­tive. Mak­ing a de­ci­sion not to have chil­dren is a pos­si­ble ex­am­ple of this. There­fore, we can think of evolu­tion as a base op­ti­mizer that pro­duced brains—mesa-op­ti­miz­ers—which then ac­tu­ally pro­duce or­ganisms’ be­hav­ior—be­hav­ior that is not nec­es­sar­ily al­igned with evolu­tion.

1.2. The in­ner and outer al­ign­ment problems

In “Scal­able agent al­ign­ment via re­ward mod­el­ing,” Leike et al. de­scribe the con­cept of the “re­ward-re­sult gap” as the differ­ence be­tween the (in their case learned) “re­ward model” (what we call the base ob­jec­tive) and the “re­ward func­tion that is re­cov­ered with perfect in­verse re­in­force­ment learn­ing” (what we call the be­hav­ioral ob­jec­tive).(8) That is, the re­ward-re­sult gap is the fact that there can be a differ­ence be­tween what a learned al­gorithm is ob­served to be do­ing and what the pro­gram­mers want it to be do­ing.

The prob­lem posed by mis­al­igned mesa-op­ti­miz­ers is a kind of re­ward-re­sult gap. Speci­fi­cally, it is the gap be­tween the base ob­jec­tive and the mesa-ob­jec­tive (which then causes a gap be­tween the base ob­jec­tive and the be­hav­ioral ob­jec­tive). We will call the prob­lem of elimi­nat­ing the base-mesa ob­jec­tive gap the in­ner al­ign­ment prob­lem, which we will con­trast with the outer al­ign­ment prob­lem of elimi­nat­ing the gap be­tween the base ob­jec­tive and the in­tended goal of the pro­gram­mers. This ter­minol­ogy is mo­ti­vated by the fact that the in­ner al­ign­ment prob­lem is an al­ign­ment prob­lem en­tirely in­ter­nal to the ma­chine learn­ing sys­tem, whereas the outer al­ign­ment prob­lem is an al­ign­ment prob­lem be­tween the sys­tem and the hu­mans out­side of it (speci­fi­cally be­tween the base ob­jec­tive and the pro­gram­mer’s in­ten­tions). In the con­text of ma­chine learn­ing, outer al­ign­ment refers to al­ign­ing the speci­fied loss func­tion with the in­tended goal, whereas in­ner al­ign­ment refers to al­ign­ing the mesa-ob­jec­tive of a mesa-op­ti­mizer with the speci­fied loss func­tion.

It might not be nec­es­sary to solve the in­ner al­ign­ment prob­lem in or­der to pro­duce safe, highly ca­pa­ble AI sys­tems, as it might be pos­si­ble to pre­vent mesa-op­ti­miz­ers from oc­cur­ring in the first place. If mesa-op­ti­miz­ers can­not be re­li­ably pre­vented, how­ever, then some solu­tion to both the outer and in­ner al­ign­ment prob­lems will be nec­es­sary to en­sure that mesa-op­ti­miz­ers are al­igned with the in­tended goal of the pro­gram­mers.

1.3. Ro­bust al­ign­ment vs. pseudo-alignment

Given enough train­ing, a mesa-op­ti­mizer should even­tu­ally be able to pro­duce out­puts that score highly on the base ob­jec­tive on the train­ing dis­tri­bu­tion. Off the train­ing dis­tri­bu­tion, how­ever—and even on the train­ing dis­tri­bu­tion while it is still early in the train­ing pro­cess—the differ­ence could be ar­bi­trar­ily large. We will use the term ro­bustly al­igned to re­fer to mesa-op­ti­miz­ers with mesa-ob­jec­tives that ro­bustly agree with the base ob­jec­tive across dis­tri­bu­tions and the term pseudo-al­igned to re­fer to mesa-op­ti­miz­ers with mesa-ob­jec­tives that agree with the base ob­jec­tive on past train­ing data, but not ro­bustly across pos­si­ble fu­ture data (ei­ther in test­ing, de­ploy­ment, or fur­ther train­ing). For a pseudo-al­igned mesa-op­ti­mizer, there will be en­vi­ron­ments in which the base and mesa- ob­jec­tives di­verge. Pseudo-al­ign­ment, there­fore, pre­sents a po­ten­tially dan­ger­ous ro­bust­ness prob­lem since it opens up the pos­si­bil­ity of a ma­chine learn­ing sys­tem that com­pe­tently takes ac­tions to achieve some­thing other than the in­tended goal when off the train­ing dis­tri­bu­tion. That is, its ca­pa­bil­ities might gen­er­al­ize while its ob­jec­tive does not.

For a toy ex­am­ple of what pseudo-al­ign­ment might look like, con­sider an RL agent trained on a maze nav­i­ga­tion task where all the doors dur­ing train­ing hap­pen to be red. Let the base ob­jec­tive (re­ward func­tion) be . On the train­ing dis­tri­bu­tion, this ob­jec­tive is equiv­a­lent to . Con­sider what would hap­pen if an agent, trained to high perfor­mance on on this task, were put in an en­vi­ron­ment where the doors are in­stead blue, and with some red ob­jects that are not doors. It might gen­er­al­ize on , re­li­ably nav­i­gat­ing to the blue door in each maze (ro­bust al­ign­ment). But it might also gen­er­al­ize on in­stead of , re­li­ably nav­i­gat­ing each maze to reach red ob­jects (pseudo-al­ign­ment).[10]

1.4. Mesa-op­ti­miza­tion as a safety problem

If pseudo-al­igned mesa-op­ti­miz­ers may arise in ad­vanced ML sys­tems, as we will sug­gest, they could pose two crit­i­cal safety prob­lems.

Un­in­tended op­ti­miza­tion. First, the pos­si­bil­ity of mesa-op­ti­miza­tion means that an ad­vanced ML sys­tem could end up im­ple­ment­ing a pow­er­ful op­ti­miza­tion pro­ce­dure even if its pro­gram­mers never in­tended it to do so. This could be dan­ger­ous if such op­ti­miza­tion leads the sys­tem to take ex­tremal ac­tions out­side the scope of its in­tended be­hav­ior in try­ing to max­i­mize its mesa-ob­jec­tive. Of par­tic­u­lar con­cern are op­ti­miz­ers with ob­jec­tive func­tions and op­ti­miza­tion pro­ce­dures that gen­er­al­ize to the real world. The con­di­tions that lead a learn­ing al­gorithm to find mesa-op­ti­miz­ers, how­ever, are very poorly un­der­stood. Know­ing them would al­low us to pre­dict cases where mesa-op­ti­miza­tion is more likely, as well as take mea­sures to dis­cour­age mesa-op­ti­miza­tion from oc­cur­ring in the first place. The sec­ond post will ex­am­ine some fea­tures of ma­chine learn­ing al­gorithms that might in­fluence their like­li­hood of find­ing mesa-op­ti­miz­ers.

In­ner al­ign­ment. Se­cond, even in cases where it is ac­cept­able for a base op­ti­mizer to find a mesa-op­ti­mizer, a mesa-op­ti­mizer might op­ti­mize for some­thing other than the speci­fied re­ward func­tion. In such a case, it could pro­duce bad be­hav­ior even if op­ti­miz­ing the cor­rect re­ward func­tion was known to be safe. This could hap­pen ei­ther dur­ing train­ing—be­fore the mesa-op­ti­mizer gets to the point where it is al­igned over the train­ing dis­tri­bu­tion—or dur­ing test­ing or de­ploy­ment when the sys­tem is off the train­ing dis­tri­bu­tion. The third post will ad­dress some of the differ­ent ways in which a mesa-op­ti­mizer could be se­lected to op­ti­mize for some­thing other than the speci­fied re­ward func­tion, as well as what at­tributes of an ML sys­tem are likely to en­courage this. In the fourth post, we will dis­cuss a pos­si­ble ex­treme in­ner al­ign­ment failure—which we be­lieve pre­sents one of the most dan­ger­ous risks along these lines—wherein a suffi­ciently ca­pa­ble mis­al­igned mesa-op­ti­mizer could learn to be­have as if it were al­igned with­out ac­tu­ally be­ing ro­bustly al­igned. We will call this situ­a­tion de­cep­tive al­ign­ment.

It may be that pseudo-al­igned mesa-op­ti­miz­ers are easy to ad­dress—if there ex­ists a re­li­able method of al­ign­ing them, or of pre­vent­ing base op­ti­miz­ers from find­ing them. How­ever, it may also be that ad­dress­ing mis­al­igned mesa-op­ti­miz­ers is very difficult—the prob­lem is not suffi­ciently well-un­der­stood at this point for us to know. Cer­tainly, cur­rent ML sys­tems do not pro­duce dan­ger­ous mesa-op­ti­miz­ers, though whether fu­ture sys­tems might is un­known. It is in­deed be­cause of these un­knowns that we be­lieve the prob­lem is im­por­tant to an­a­lyze.

The sec­ond post in the Risks from Learned Op­ti­miza­tion Se­quence, ti­tled “Con­di­tions for Mesa-Op­ti­miza­tion,” can be found here.

Glos­sary | Bibliography

  1. As a con­crete ex­am­ple of what a neu­ral net­work op­ti­mizer might look like, con­sider TreeQN.(2) TreeQN, as de­scribed in Far­quhar et al., is a Q-learn­ing agent that performs model-based plan­ning (via tree search in a la­tent rep­re­sen­ta­tion of the en­vi­ron­ment states) as part of its com­pu­ta­tion of the Q-func­tion. Though their agent is an op­ti­mizer by de­sign, one could imag­ine a similar al­gorithm be­ing learned by a DQN agent with a suffi­ciently ex­pres­sive ap­prox­i­ma­tor for the Q func­tion. Univer­sal Plan­ning Net­works, as de­scribed by Srini­vas et al.,(3) provide an­other ex­am­ple of a learned sys­tem that performs op­ti­miza­tion, though the op­ti­miza­tion there is built-in in the form of SGD via au­to­matic differ­en­ti­a­tion. How­ever, re­search such as that in Andrychow­icz et al.(4) and Duan et al.(5) demon­strate that op­ti­miza­tion al­gorithms can be learned by RNNs, mak­ing it pos­si­ble that a Univer­sal Plan­ning Net­works-like agent could be en­tirely learned—as­sum­ing a very ex­pres­sive model space—in­clud­ing the in­ter­nal op­ti­miza­tion steps. Note that while these ex­am­ples are taken from re­in­force­ment learn­ing, op­ti­miza­tion might in prin­ci­ple take place in any suffi­ciently ex­pres­sive learned sys­tem. ↩︎

  2. Pre­vi­ous work in this space has of­ten cen­tered around the con­cept of “op­ti­miza­tion dae­mons,”(6) a frame­work that we be­lieve is po­ten­tially mis­lead­ing and hope to sup­plant. Notably, the term “op­ti­miza­tion dae­mon” came out of dis­cus­sions re­gard­ing the na­ture of hu­mans and evolu­tion, and, as a re­sult, car­ries an­thro­po­mor­phic con­no­ta­tions. ↩︎

  3. The word mesa has been pro­posed as the op­po­site of meta.(7) The du­al­ity comes from think­ing of meta-op­ti­miza­tion as one layer above the base op­ti­mizer and mesa-op­ti­miza­tion as one layer be­low. ↩︎

  4. That be­ing said, some of our con­sid­er­a­tions do still ap­ply even in that case. ↩︎

  5. Leike et al.(8) in­tro­duce the con­cept of an ob­jec­tive re­cov­ered from perfect IRL. ↩︎

  6. For the for­mal con­struc­tion of this ob­jec­tive, see pg. 6 in Leike et al.(8) ↩︎

  7. This ob­jec­tive is by defi­ni­tion triv­ially op­ti­mal in any situ­a­tion that the bot­tle­cap finds it­self in. ↩︎

  8. Ul­ti­mately, our worry is op­ti­miza­tion in the di­rec­tion of some co­her­ent but un­safe ob­jec­tive. In this se­quence, we as­sume that search pro­vides suffi­cient struc­ture to ex­pect co­her­ent ob­jec­tives. While we be­lieve this is a rea­son­able as­sump­tion, it is un­clear both whether search is nec­es­sary and whether it is suffi­cient. Fur­ther work ex­am­in­ing this as­sump­tion will likely be needed. ↩︎

  9. The situ­a­tion with evolu­tion is more com­pli­cated than is pre­sented here and we do not ex­pect our anal­ogy to live up to in­tense scrutiny. We pre­sent it as noth­ing more than that: an evoca­tive anal­ogy (and, to some ex­tent, an ex­is­tence proof) that ex­plains the key con­cepts. More care­ful ar­gu­ments are pre­sented later. ↩︎

  10. Of course, it might also fail to gen­er­al­ize at all. ↩︎