[AN #58] Mesa optimization: what it is, and why we should care

Link post

Find all Align­ment Newslet­ter re­sources here. In par­tic­u­lar, you can sign up, or look through this spread­sheet of all sum­maries that have ever been in the newslet­ter. I’m always happy to hear feed­back; you can send it to me by re­ply­ing to this email.


Risks from Learned Op­ti­miza­tion in Ad­vanced Ma­chine Learn­ing Sys­tems (Evan Hub­inger et al): Sup­pose you search over a space of pro­grams, look­ing for one that plays TicTacToe well. Ini­tially, you might find some good heuris­tics, e.g. go for the cen­ter square, if you have two along a row then place the third one, etc. But even­tu­ally you might find the min­i­max al­gorithm, which plays op­ti­mally by search­ing for the best ac­tion to take. Notably, your outer op­ti­miza­tion over the space of pro­grams found a pro­gram that was it­self an op­ti­mizer that searches over pos­si­ble moves. In the lan­guage of this pa­per, the min­i­max al­gorithm is a mesa op­ti­mizer: an op­ti­mizer that is found au­tonomously by a base op­ti­mizer, in this case the search over pro­grams.

Why is this rele­vant to AI? Well, gra­di­ent de­scent is an op­ti­miza­tion al­gorithm that searches over the space of neu­ral net pa­ram­e­ters to find a set that performs well on some ob­jec­tive. It seems plau­si­ble that the same thing could oc­cur: gra­di­ent de­scent could find a model that is it­self perform­ing op­ti­miza­tion. That model would then be a mesa op­ti­mizer, and the ob­jec­tive that it op­ti­mizes is the mesa ob­jec­tive. Note that while the mesa ob­jec­tive should lead to similar be­hav­ior as the base ob­jec­tive on the train­ing dis­tri­bu­tion, it need not do so off dis­tri­bu­tion. This means the mesa ob­jec­tive is pseudo al­igned; if it also leads to similar be­hav­ior off dis­tri­bu­tion it is ro­bustly al­igned.

A cen­tral worry with AI al­ign­ment is that if pow­er­ful AI agents op­ti­mize the wrong ob­jec­tive, it could lead to catas­trophic out­comes for hu­man­ity. With the pos­si­bil­ity of mesa op­ti­miz­ers, this worry is dou­bled: we need to en­sure both that the base ob­jec­tive is al­igned with hu­mans (called outer al­ign­ment) and that the mesa ob­jec­tive is al­igned with the base ob­jec­tive (called in­ner al­ign­ment). A par­tic­u­larly wor­ry­ing as­pect is de­cep­tive al­ign­ment: the mesa op­ti­mizer has a long-term mesa ob­jec­tive, but knows that it is be­ing op­ti­mized for a base ob­jec­tive. So, it op­ti­mizes the base ob­jec­tive dur­ing train­ing to avoid be­ing mod­ified, but at de­ploy­ment when the threat of mod­ifi­ca­tion is gone, it pur­sues only the mesa ob­jec­tive.

As a mo­ti­vat­ing ex­am­ple, if some­one wanted to cre­ate the best biolog­i­cal repli­ca­tors, they could have rea­son­ably used nat­u­ral se­lec­tion /​ evolu­tion as an op­ti­miza­tion al­gorithm for this goal. How­ever, this then would lead to the cre­ation of hu­mans, who would be mesa op­ti­miz­ers that op­ti­mize for other goals, and don’t op­ti­mize for repli­ca­tion (e.g. by us­ing birth con­trol).

The pa­per has a lot more de­tail and anal­y­sis of what fac­tors make mesa-op­ti­miza­tion more likely, more dan­ger­ous, etc. You’ll have to read the pa­per for all of these de­tails. One gen­eral pat­tern is that, when us­ing ma­chine learn­ing for some task X, there are a bunch of prop­er­ties that af­fect the like­li­hood of learn­ing heuris­tics or prox­ies rather than ac­tu­ally learn­ing the op­ti­mal al­gorithm for X. For any such prop­erty, mak­ing heuris­tics/​prox­ies more likely would re­sult in a lower chance of mesa-op­ti­miza­tion (since op­ti­miz­ers are less like heuris­tics/​prox­ies), but con­di­tional on mesa-op­ti­miza­tion aris­ing, makes it more likely that it is pseudo al­igned in­stead of ro­bustly al­igned (be­cause now the pres­sure for heuris­tics/​prox­ies leads to learn­ing a proxy mesa-ob­jec­tive in­stead of the true base ob­jec­tive).

Ro­hin’s opinion: I’m glad this pa­per has fi­nally come out. The con­cepts of mesa op­ti­miza­tion and the in­ner al­ign­ment prob­lem seem quite im­por­tant, and cur­rently I am most wor­ried about x-risk caused by a mis­al­igned mesa op­ti­mizer. Un­for­tu­nately, it is not yet clear whether mesa op­ti­miz­ers will ac­tu­ally arise in prac­tice, though I think con­di­tional on us de­vel­op­ing AGI it is quite likely. Gra­di­ent de­scent is a rel­a­tively weak op­ti­mizer; it seems like AGI would have to be much more pow­er­ful, and so would re­quire a learned op­ti­mizer (in the same way that hu­mans can be thought of as “op­ti­miz­ers learned by evolu­tion”).

There still is a lot of con­fu­sion and un­cer­tainty around the con­cept, es­pe­cially be­cause we don’t have a good defi­ni­tion of “op­ti­miza­tion”. It also doesn’t help that it’s hard to get an ex­am­ple of this in an ex­ist­ing ML sys­tem—to­day’s sys­tems are likely not pow­er­ful enough to have a mesa op­ti­mizer (though even if they had a mesa op­ti­mizer, we might not be able to tell be­cause of how un­in­ter­pretable the mod­els are).

Read more: Align­ment Fo­rum version

Tech­ni­cal AI alignment

Agent foundations

Selec­tion vs Con­trol (Abram Dem­ski): The pre­vi­ous pa­per fo­cuses on mesa op­ti­miz­ers that are ex­plic­itly search­ing across a space of pos­si­bil­ities for an op­tion that performs well on some ob­jec­tive. This post ar­gues that in ad­di­tion to this “se­lec­tion” model of op­ti­miza­tion, there is a “con­trol” model of op­ti­miza­tion, where the model can­not eval­u­ate all of the op­tions sep­a­rately (as in e.g. a heat-seek­ing mis­sile, which can’t try all of the pos­si­ble paths to the tar­get sep­a­rately). How­ever, these are not cleanly sep­a­rated cat­e­gories—for ex­am­ple, a search pro­cess could have con­trol-based op­ti­miza­tion in­side of it, in the form of heuris­tics that guide the search to­wards more likely re­gions of the search space.

Ro­hin’s opinion: This is an im­por­tant dis­tinc­tion, and I’m of the opinion that most of what we call “in­tel­li­gence” is ac­tu­ally more like the “con­trol” side of these two op­tions.

Learn­ing hu­man intent

Imi­ta­tion Learn­ing as f-Diver­gence Min­i­miza­tion (Liy­iming Ke et al) (sum­ma­rized by Cody): This pa­per frames imi­ta­tion learn­ing through the lens of match­ing your model’s dis­tri­bu­tion over tra­jec­to­ries (or con­di­tional ac­tions) to the dis­tri­bu­tion of an ex­pert policy. This fram­ing of dis­tri­bu­tion com­par­i­son nat­u­rally leads to the dis­cus­sion of f-di­ver­gences, a broad set of mea­sures in­clud­ing KL and Jen­son-Shan­non Diver­gences. The pa­per ar­gues that ex­ist­ing imi­ta­tion learn­ing meth­ods have im­plic­itly cho­sen di­ver­gence mea­sures that in­cen­tivize “mode cov­er­ing” (mak­ing sure to have sup­port any­where the ex­pert does) vs mode col­laps­ing (mak­ing sure to only have sup­port where the ex­pert does), and that the lat­ter is more ap­pro­pri­ate for safety rea­sons, since the av­er­age be­tween two modes of an ex­pert policy may not it­self be a safe policy. They demon­strate this by us­ing a vari­a­tional ap­prox­i­ma­tion of the re­verse-KL dis­tance as the di­ver­gence un­der­ly­ing their imi­ta­tion learner.

Cody’s opinion: I ap­pre­ci­ate pa­pers like these that con­nect peo­ples in­tu­itions be­tween differ­ent ar­eas (like imi­ta­tion learn­ing and dis­tri­bu­tional differ­ence mea­sures). It does seem like this would even more strongly lead to lack of abil­ity to out­perform the demon­stra­tor, but that’s hon­estly more a cri­tique of imi­ta­tion learn­ing more gen­er­ally than this pa­per in par­tic­u­lar.

Han­dling groups of agents

So­cial In­fluence as In­trin­sic Mo­ti­va­tion for Multi-Agent Deep RL (Natasha Jaques et al) (sum­ma­rized by Cody): An emerg­ing field of com­mon-sum multi-agent re­search asks how to in­duce groups of agents to perform com­plex co­or­di­na­tion be­hav­ior to in­crease gen­eral re­ward, and many ex­ist­ing ap­proaches in­volve cen­tral­ized train­ing or hard­cod­ing al­tru­is­tic be­hav­ior into the agents. This pa­per sug­gests a new tech­nique that re­wards agents for hav­ing a causal in­fluence over the ac­tions of other agents, in the sense that the ac­tions of the pair of agents agents have high mu­tual in­for­ma­tion. The au­thors em­piri­cally find that hav­ing even a small num­ber of agents who act as “in­fluencers” can help avoid co­or­di­na­tion failures in par­tial in­for­ma­tion set­tings and lead to higher col­lec­tive re­ward. In one sub-ex­per­i­ment, they only add this in­fluence re­ward to the agents’ com­mu­ni­ca­tion chan­nels, so agents are in­cen­tivized to provide in­for­ma­tion that will im­pact other agents’ ac­tions (this in­for­ma­tion is pre­sumed to be truth­ful and benefi­cial since oth­er­wise it would sub­se­quently be ig­nored).

Cody’s opinion: I’m in­ter­ested by this pa­per’s find­ing that you can gen­er­ate ap­par­ently al­tru­is­tic be­hav­ior by in­cen­tiviz­ing agents to in­fluence oth­ers, rather than nec­es­sar­ily help oth­ers. I also ap­pre­ci­ate the point that was made to train in a de­cen­tral­ized way. I’d love to see more work on a less asym­met­ric ver­sion of in­fluence re­ward; cur­rently in­fluencers and in­fluencees are sep­a­rate groups due to wor­ries about causal feed­back loops, and this im­plic­itly means there’s a con­structed group of quasi-al­tru­is­tic agents who are get­ting less con­crete re­ward be­cause they’re be­ing in­cen­tivized by this aux­iliary re­ward.


ICML Uncer­tainty and Ro­bust­ness Work­shop Ac­cepted Papers (sum­ma­rized by Dan H): The Uncer­tainty and Ro­bust­ness Work­shop ac­cepted pa­pers are available. Topics in­clude out-of-dis­tri­bu­tion de­tec­tion, gen­er­al­iza­tion to stochas­tic cor­rup­tions, la­bel cor­rup­tion ro­bust­ness, and so on.

Mis­cel­la­neous (Align­ment)

To first or­der, moral re­al­ism and moral anti-re­al­ism are the same thing (Stu­art Arm­strong)

AI strat­egy and policy

Grover: A State-of-the-Art Defense against Neu­ral Fake News (Rowan Zel­lers et al): Could we use ML to de­tect fake news gen­er­ated by other ML mod­els? This pa­per sug­gests that mod­els that are used to gen­er­ate fake news will also be able to be used to de­tect that same fake news. In par­tic­u­lar, they train a GAN-like lan­guage model on news ar­ti­cles, that they dub GROVER, and show that the gen­er­ated ar­ti­cles are bet­ter pro­pa­ganda than those gen­er­ated by hu­mans, but they can at least be de­tected by GROVER it­self.

Notably, they do plan to re­lease their mod­els, so that other re­searchers can also work on the prob­lem of de­tect­ing fake news. They are fol­low­ing a similar re­lease strat­egy as with GPT-2 (AN #46): they are mak­ing the 117M and 345M pa­ram­e­ter mod­els pub­lic, and re­leas­ing their 1.5B pa­ram­e­ter model to re­searchers who sign a re­lease form.

Ro­hin’s opinion: It’s in­ter­est­ing to see that this group went with a very similar re­lease strat­egy, and I wish they had writ­ten more about why they chose to do what they did. I do like that they are on the face of it “co­op­er­at­ing” with OpenAI, but even­tu­ally we need norms for how to make pub­li­ca­tion de­ci­sions, rather than always fol­low­ing the prece­dent set by some­one prior. Though I sup­pose there could be a bit more risk with their mod­els—while they are the same size as the re­leased GPT-2 mod­els, they are bet­ter tuned for gen­er­at­ing pro­pa­ganda than GPT-2 is.

Read more: Defend­ing Against Neu­ral Fake News

The Hacker Learns to Trust (Con­nor Leahy): An in­de­pen­dent re­searcher at­tempted to repli­cate GPT-2 (AN #46) and was plan­ning to re­lease the model. How­ever, he has now de­cided not to re­lease, be­cause re­leas­ing would set a bad prece­dent. Re­gard­less of whether or not GPT-2 is dan­ger­ous, at some point in the fu­ture, we will de­velop AI sys­tems that re­ally are dan­ger­ous, and we need to have ad­e­quate norms then that al­low re­searchers to take their time and eval­u­ate the po­ten­tial is­sues and then make an in­formed de­ci­sion about what to do. Key quote: “send­ing a mes­sage that it is ok, even cel­e­brated, for a lone in­di­vi­d­ual to unilat­er­ally go against rea­son­able safety con­cerns of other re­searchers is not a good mes­sage to send”.

Ro­hin’s opinion: I quite strongly agree that the most im­por­tant im­pact of the GPT-2 de­ci­sion was that it has started a dis­cus­sion about what ap­pro­pri­ate safety norms should be, whereas be­fore there were no such norms at all. I don’t know whether or not GPT-2 is dan­ger­ous, but I am glad that AI re­searchers have started think­ing about whether and how pub­li­ca­tion norms should change.

Other progress in AI

Re­in­force­ment learning

A Sur­vey of Re­in­force­ment Learn­ing In­formed by Nat­u­ral Lan­guage (Je­lena Luketina et al) (sum­ma­rized by Cody): Hu­mans use lan­guage as a way of effi­ciently stor­ing knowl­edge of the world and in­struc­tions for han­dling new sce­nar­ios; this pa­per is writ­ten from the per­spec­tive that it would be po­ten­tially hugely valuable if RL agents could lev­er­age in­for­ma­tion stored in lan­guage in similar ways. They look at both the case where lan­guage is an in­her­ent part of the task (ex­am­ple: the goal is pa­ram­e­ter­ized by a lan­guage in­struc­tion) and where lan­guage is used to give aux­iliary in­for­ma­tion (ex­am­ple: parts of the en­vi­ron­ment are de­scribed us­ing lan­guage). Over­all, the au­thors push for more work in this area, and, in par­tic­u­lar, more work us­ing ex­ter­nal-cor­pus-pre­trained lan­guage mod­els and with re­search de­signs that use hu­man-gen­er­ated rather than syn­thet­i­cally-gen­er­ated lan­guage; the lat­ter is typ­i­cally preferred for the sake of speed, but the former has par­tic­u­lar challenges we’ll need to tackle to ac­tu­ally use ex­ist­ing sources of hu­man lan­guage data.

Cody’s opinion: This ar­ti­cle is a solid and use­ful ver­sion of what I would ex­pect out of a re­view ar­ti­cle: mostly use­ful as a way to get think­ing in the di­rec­tion of the in­ter­sec­tion of RL and lan­guage, and makes me more in­ter­ested in dig­ging more into some of the men­tioned tech­niques, since by de­sign this re­view didn’t go very deep into any of them.

Deep learning

the trans­former … “ex­plained”? (nos­talge­braist) (H/​T Daniel Filan): This is an ex­cel­lent ex­pla­na­tion of the in­tu­itions and ideas be­hind self-at­ten­tion and the Trans­former ar­chi­tec­ture (AN #44).

Ray In­terfer­ence: a Source of Plateaus in Deep Re­in­force­ment Learn­ing (Tom Schaul et al) (sum­ma­rized by Cody): The au­thors ar­gue that Deep RL is sub­ject to a par­tic­u­lar kind of train­ing pathol­ogy called “ray in­terfer­ence”, caused by situ­a­tions where (1) there are mul­ti­ple sub-tasks within a task, and the gra­di­ent up­date of one can de­crease perfor­mance on the oth­ers, and (2) the abil­ity to learn on a given sub-task is a func­tion of its cur­rent perfor­mance. Perfor­mance in­terfer­ence can hap­pen when­ever there are shared com­po­nents be­tween no­tional sub­com­po­nents or sub­tasks, and the fact that many RL al­gorithms learn on-policy means that low perfor­mance might lead to lit­tle data col­lec­tion in a re­gion of pa­ram­e­ter space, and make it harder to in­crease perfor­mance there in fu­ture.

Cody’s opinion: This seems like a use­ful men­tal con­cept, but it seems quite difficult to effec­tively rem­edy, ex­cept through prefer­ring off-policy meth­ods to on-policy ones, since there isn’t re­ally a way to de­com­pose real RL tasks into sep­a­rable com­po­nents the way they do in their toy example

Meta learning

Alpha MAML: Adap­tive Model-Ag­nos­tic Meta-Learn­ing (Hark­irat Singh Behl et al)