[AN #63] How architecture search, meta learning, and environment design could lead to general intelligence

Link post

Find all Align­ment Newslet­ter re­sources here. In par­tic­u­lar, you can sign up, or look through this spread­sheet of all sum­maries that have ever been in the newslet­ter. I’m always happy to hear feed­back; you can send it to me by re­ply­ing to this email.

Au­dio ver­sion here (may not be up yet).


AI-GAs: AI-gen­er­at­ing al­gorithms, an al­ter­nate paradigm for pro­duc­ing gen­eral ar­tifi­cial in­tel­li­gence (Jeff Clune) (sum­ma­rized by Yuxi Liu and Ro­hin): His­tor­i­cally, the bit­ter les­son (AN #49) has been that ap­proaches that lev­er­age in­creas­ing com­pu­ta­tion for learn­ing out­perform ones that build in a lot of knowl­edge. The cur­rent ethos to­wards AGI seems to be that we will come up with a bunch of build­ing blocks (e.g. con­volu­tions, trans­form­ers, trust re­gions, GANs, ac­tive learn­ing, cur­ricula) that we will some­how man­u­ally com­bine into one com­plex pow­er­ful AI sys­tem. Rather than re­quire this man­ual ap­proach, we could in­stead ap­ply learn­ing once more, giv­ing the paradigm of AI-gen­er­at­ing al­gorithms, or AI-GA.

AI-GA has three pillars. The first is to learn ar­chi­tec­tures: this is analo­gous to a su­per­pow­ered neu­ral ar­chi­tec­ture search that can dis­cover con­volu­tions, re­cur­rence and at­ten­tion with­out any hard­cod­ing. The sec­ond is to learn the learn­ing al­gorithms, i.e. meta-learn­ing. The third and most un­der­ex­plored pillar is to learn to gen­er­ate com­plex and di­verse en­vi­ron­ments within which to train our agents. This is a nat­u­ral ex­ten­sion of meta-learn­ing: with meta-learn­ing, you have to spec­ify the dis­tri­bu­tion of tasks the agent should perform well on; AI-GA sim­ply says to learn this dis­tri­bu­tion as well. POET (AN #41) is an ex­am­ple of re­cent work in this area.

A strong rea­son for op­ti­mism about the AI-GA paradigm is that it mimics the way that hu­mans arose: nat­u­ral se­lec­tion was a very sim­ple al­gorithm that with a lot of com­pute and a very com­plex and di­verse en­vi­ron­ment was able to pro­duce a gen­eral in­tel­li­gence: us. Since it would need fewer build­ing blocks (since it aims to learn ev­ery­thing), it could suc­ceed faster than the man­ual ap­proach, at least if the re­quired amount of com­pute is not too high. It is also much more ne­glected than the “man­ual” ap­proach.

How­ever, there are safety con­cerns. Any pow­er­ful AI that comes from an AI-GA will be harder to un­der­stand, since it’s pro­duced by this vast com­pu­ta­tion where ev­ery­thing is learned, and so it would be hard to get an AI that is al­igned with our val­ues. In ad­di­tion, with such a pro­cess it seems more likely that a pow­er­ful AI sys­tem “catches us by sur­prise”—at some point the stars al­ign and the gi­ant com­pu­ta­tion makes one good ran­dom choice and sud­denly it out­puts a very pow­er­ful and sam­ple effi­cient learn­ing al­gorithm (aka an AGI, at least by some defi­ni­tions). There is also the eth­i­cal con­cern that since we’d end up mimick­ing evolu­tion, we might ac­ci­den­tally in­stan­ti­ate large amounts of simu­lated be­ings that can suffer (es­pe­cially if the en­vi­ron­ment is com­pet­i­tive, as was the case with evolu­tion).

Ro­hin’s opinion: Espe­cially given the growth of com­pute (AN #7), this agenda seems like a nat­u­ral one to pur­sue to get AGI. Un­for­tu­nately, it also mir­rors very closely the phe­nomenon of mesa op­ti­miza­tion (AN #58), with the only differ­ence be­ing that it is in­tended that the method pro­duces a pow­er­ful in­ner op­ti­mizer. As the pa­per ac­knowl­edges, this in­tro­duces sev­eral risks, and so it calls for deep en­gage­ment with AI safety re­searchers (but sadly it does not pro­pose ideas on how to miti­gate the risks).

Due to the vast data re­quire­ments, most of the en­vi­ron­ments would have to be simu­lated. I sus­pect that this will make the agenda harder than it may seem at first glance—I think that the com­plex­ity of the real world was quite cru­cial, and that simu­lat­ing en­vi­ron­ments that reach the ap­pro­pri­ate level of com­plex­ity will be a very difficult task. (My in­tu­ition is that some­thing like Neu­ral MMO (AN #48) is nowhere near enough com­plex­ity.)

Tech­ni­cal AI alignment


The “Com­mit­ment Races” prob­lem (Daniel Koko­ta­jlo) (sum­ma­rized by Ro­hin): When two agents are in a com­pet­i­tive game, it is of­ten to each agent’s ad­van­tage to quickly make a cred­ible com­mit­ment be­fore the other can. For ex­am­ple, in Chicken (both play­ers drive a car straight to­wards the other and the first to swerve out of the way loses), an agent could rip out their steer­ing wheel, thus cred­ibly com­mit­ting to driv­ing straight. The first agent to do so would likely win the game. Thus, agents have an in­cen­tive to make com­mit­ments as quickly as pos­si­ble, be­fore their com­peti­tors can make com­mit­ments them­selves. This trades off against the in­cen­tive to think care­fully about com­mit­ments, and may re­sult in ar­bi­trar­ily bad out­comes.

Iter­ated amplification

Towards a mechanis­tic un­der­stand­ing of cor­rigi­bil­ity (Evan Hub­inger) (sum­ma­rized by Ro­hin): One gen­eral ap­proach to al­ign AI is to train and ver­ify that an AI sys­tem performs ac­cept­ably on all in­puts. How­ever, we can’t do this by sim­ply try­ing out all in­puts, and so for ver­ifi­ca­tion we need to have an ac­cept­abil­ity crite­rion that is a func­tion of the “struc­ture” of the com­pu­ta­tion, as op­posed to just in­put-out­put be­hav­ior. This post in­ves­ti­gates what this might look like if the ac­cept­abil­ity crite­rion is some fla­vor of cor­rigi­bil­ity, for an AI trained via am­plifi­ca­tion.

Agent foundations

Troll Bridge (Abram Dem­ski) (sum­ma­rized by Ro­hin): This is a par­tic­u­larly clean ex­po­si­tion of the Troll Bridge prob­lem in de­ci­sion the­ory. In this prob­lem, an agent is de­ter­min­ing whether to cross a bridge guarded by a troll who will blow up the agent if its rea­son­ing is in­con­sis­tent. It turns out that an agent with con­sis­tent rea­son­ing can prove that if it crosses, it will be de­tected as in­con­sis­tent and blown up, and so it de­cides not to cross. This is rather strange rea­son­ing about coun­ter­fac­tu­als—we’d ex­pect per­haps that the agent is un­cer­tain about whether its rea­son­ing is con­sis­tent or not.

Two senses of “op­ti­mizer” (Joar Skalse) (sum­ma­rized by Ro­hin): The first sense of “op­ti­mizer” is an op­ti­miza­tion al­gorithm, that given some for­mally speci­fied prob­lem com­putes the solu­tion to that prob­lem, e.g. a SAT solver or lin­ear pro­gram solver. The sec­ond sense is an al­gorithm that acts upon its en­vi­ron­ment to change it. Joar be­lieves that peo­ple of­ten con­flate the two in AI safety.

Ro­hin’s opinion: I agree that this is an im­por­tant dis­tinc­tion to keep in mind. It seems to me that the dis­tinc­tion is whether the op­ti­mizer has knowl­edge about the en­vi­ron­ment: in canon­i­cal ex­am­ples of the first kind of op­ti­mizer, it does not. If we some­how en­coded the dy­nam­ics of the world as a SAT for­mula and asked a su­per-pow­er­ful SAT solver to solve for the ac­tions that ac­com­plish some goal, it would look like the sec­ond kind of op­ti­mizer.

Ad­ver­sar­ial examples

Test­ing Ro­bust­ness Against Un­fore­seen Ad­ver­saries (Daniel Kang et al) (sum­ma­rized by Cody): This pa­per demon­strates that ad­ver­sar­i­ally train­ing on just one type or fam­ily of ad­ver­sar­ial dis­tor­tions fails to provide gen­eral ro­bust­ness against differ­ent kinds of pos­si­ble dis­tor­tions. In par­tic­u­lar, they show that ad­ver­sar­ial train­ing against L-p norm ball dis­tor­tions trans­fer rea­son­ably well to other L-p norm ball at­tacks, but pro­vides lit­tle value, and can in fact re­duce ro­bust­ness, when eval­u­ated on other fam­i­lies of at­tacks, such as ad­ver­sar­i­ally-cho­sen Ga­bor noise, “snow” noise, or JPEG com­pres­sion. In ad­di­tion to propos­ing these new per­tur­ba­tion types be­yond the typ­i­cal L-p norm ball, the pa­per also pro­vides a “cal­ibra­tion table” with ep­silon sizes they judge to be com­pa­rable be­tween at­tack types, by eval­u­at­ing them ac­cord­ing to how much they re­duce ac­cu­racy on ei­ther a defended or un­defended model. (Be­cause at­tacks are so differ­ent in ap­proach, a given nu­mer­i­cal value of ep­silon won’t cor­re­spond to the same “strength” of at­tack across meth­ods)

Cody’s opinion: I didn’t per­son­ally find this pa­per hugely sur­pris­ing, given the past pat­tern of whack-a-mole be­tween at­tack and defense sug­gest­ing that defenses tend to be limited in their scope, and don’t con­fer gen­eral ro­bust­ness. That said, I ap­pre­ci­ate how cen­trally the au­thors lay this lack of trans­fer as a prob­lem, and the effort they put in to gen­er­at­ing new at­tack types and cal­ibrat­ing them so they can be mean­ingfully com­pared to ex­ist­ing L-p norm ball ones.

Ro­hin’s opinion: I see this pa­per as call­ing for ad­ver­sar­ial ex­am­ples re­searchers to stop fo­cus­ing just on the L-p norm ball, in line with one of the re­sponses (AN #62) to the last newslet­ter’s high­light, Ad­ver­sar­ial Ex­am­ples Are Not Bugs, They Are Fea­tures (AN #62).

Read more: Test­ing Ro­bust­ness Against Un­fore­seen Adversaries


An Em­piri­cal Eval­u­a­tion on Ro­bust­ness and Uncer­tainty of Reg­u­lariza­tion Meth­ods (Sanghyuk Chun et al) (sum­ma­rized by Dan H): There are sev­eral small tricks to im­prove clas­sifi­ca­tion perfor­mance such as la­bel smooth­ing, dropout-like reg­u­lariza­tion, mixup, and so on. How­ever, this pa­per shows that many of these tech­niques have mixed and of­ten nega­tive effects on var­i­ous no­tions of ro­bust­ness and un­cer­tainty es­ti­mates.

Cri­tiques (Align­ment)

Con­ver­sa­tion with Ernie Davis (Robert Long and Ernie Davis)

Mis­cel­la­neous (Align­ment)

Dis­tance Func­tions are Hard (Grue_Slinky) (sum­ma­rized by Ro­hin): Many ideas in AI al­ign­ment re­quire some sort of dis­tance func­tion. For ex­am­ple, in Func­tional De­ci­sion The­ory, we’d like to know how “similar” two al­gorithms are (which can in­fluence whether or not we think we have “log­i­cal con­trol” over them). This post ar­gues that defin­ing such dis­tance func­tions is hard, be­cause they rely on hu­man con­cepts that are not eas­ily for­mal­iz­able, and the in­tu­itive math­e­mat­i­cal for­mal­iza­tions usu­ally have some flaw.

Ro­hin’s opinion: I cer­tainly agree that defin­ing “con­cep­tual” dis­tance func­tions is hard. It has similar prob­lems to say­ing “write down a util­ity func­tion that cap­tures hu­man val­ues”—it’s pos­si­ble in the­ory but in prac­tice we’re not go­ing to think of all the edge cases. How­ever, it seems pos­si­ble to learn dis­tance func­tions rather than defin­ing them; this is already done in per­cep­tion and state es­ti­ma­tion.

AI Align­ment Pod­cast: On Con­scious­ness, Qualia, and Mean­ing (Lu­cas Perry, Mike John­son and An­drés Gómez Emils­son)

AI strat­egy and policy

Soft take­off can still lead to de­ci­sive strate­gic ad­van­tage (Daniel Koko­ta­jlo) (sum­ma­rized by Ro­hin): Since there will be an im­proved ver­sion of this post soon, I will sum­ma­rize it then.

FLI Pod­cast: Beyond the Arms Race Nar­ra­tive: AI & China (Ariel Conn, He­len Toner and Elsa Ka­nia)

Re­duc­ing mal­i­cious use of syn­thetic me­dia re­search: Con­sid­er­a­tions and po­ten­tial re­lease prac­tices for ma­chine learn­ing (Aviv Ovadya et al)

Other progress in AI

Re­in­force­ment learning

Are Deep Policy Gra­di­ent Al­gorithms Truly Policy Gra­di­ent Al­gorithms? (An­drew Ilyas et al) (sum­ma­rized by Cody) (H/​T Lawrence Chan): This pa­per in­ves­ti­gates whether and to what ex­tent the stated con­cep­tual jus­tifi­ca­tions for com­mon Policy Gra­di­ent al­gorithms are ac­tu­ally the things driv­ing their suc­cess. The pa­per has two pri­mary strains of em­piri­cal in­ves­ti­ga­tion.

In the first, they ex­am­ine a few of the more rigor­ously the­o­rized as­pects of policy gra­di­ent meth­ods: learned value func­tions as baselines for ad­van­tage calcu­la­tions, sur­ro­gate re­wards, and en­force­ment of a “trust re­gion” where the KL di­ver­gence be­tween old and up­dated policy is bounded in some way. For value func­tions and sur­ro­gate re­wards, the au­thors find that both of these ap­prox­i­ma­tions are weak and perform poorly rel­a­tive to the true value func­tion and re­ward land­scape re­spec­tively.

Ba­si­cally, it turns out that we lose a lot by ap­prox­i­mat­ing in this con­text. When it comes to en­forc­ing a trust re­gion, they show that TRPO is able to en­force a bound on mean KL, but that it’s much looser than the (more the­o­ret­i­cally jus­tified) bound on max KL that would be ideal but is hard to calcu­late. PPO is even stranger: they find that it en­forces a mean KL bound, but only when op­ti­miza­tions pre­sent in the canon­i­cal im­ple­men­ta­tion, but not the core defi­ni­tion of the al­gorithm, are pre­sent. Th­ese op­ti­miza­tions in­clude: a cus­tom weight ini­tial­iza­tion scheme, learn­ing rate an­neal­ing on Adam, and re­ward val­ues that are nor­mal­ized ac­cord­ing to a rol­ling sum. All of these op­ti­miza­tions con­tribute to non-triv­ial in­creases in perfor­mance over the base al­gorithm, in ad­di­tion to ap­par­ently be­ing cen­tral to how PPO main­tains its trust re­gion.

Cody’s opinion: This pa­per seems like one that will make RL re­searchers use­fully un­com­fortable, by point­ing out that the com­plex­ity of our im­ple­men­ta­tions means that just hav­ing a the­o­ret­i­cal story of your al­gorithm’s perfor­mance and em­piri­cal val­i­da­tion of that height­ened perfor­mance isn’t ac­tu­ally enough to con­firm that the the­ory is ac­tu­ally the thing driv­ing the perfor­mance. I do think the au­thors were a bit overly crit­i­cal at points: I don’t think any­one work­ing in RL would have ex­pected that the learned value func­tion was perfect, or that gra­di­ent up­dates were un-noisy. But, it’s a good re­minder that say­ing things like “value func­tions as a baseline de­crease var­i­ance” should be grounded in an em­piri­cal ex­am­i­na­tion of how good they are at it, rather than just a the­o­ret­i­cal ar­gu­ment that they should.

Learn­ing to Learn with Prob­a­bil­is­tic Task Embed­dings (Kate Rakelly, Aurick Zhou et al) (sum­ma­rized by Cody): This pa­per pro­poses a solu­tion to off-policy meta re­in­force­ment learn­ing, an ap­peal­ing prob­lem be­cause on-policy RL is so sam­ple-in­ten­sive, and meta-RL is even worse be­cause it needs to solve a dis­tri­bu­tion over RL prob­lems. The au­thors’ ap­proach di­vides the prob­lem into two sub­prob­lems: in­fer an em­bed­ding, z, of the cur­rent task given con­text, and learn­ing an op­ti­mal policy q func­tion con­di­tioned on that task em­bed­ding. At the be­gin­ning of each task, z is sam­pled from the (Gaus­sian) prior, and as the agent gains more sam­ples of that par­tic­u­lar task, it up­dates its pos­te­rior over z, which can be thought of as re­fin­ing its guess as to which task it’s been dropped into this time. The trick here is that this sub­di­vid­ing of the prob­lem al­lows it to be done mostly off-policy, be­cause you only need to use on-policy learn­ing for the task in­fer­ence com­po­nent (pre­dict­ing z given cur­rent task tran­si­tions), and can learn the Ac­tor-Critic model con­di­tioned on z with off-policy data. The method works by al­ter­nat­ing be­tween these two learn­ing modes.

Cody’s opinion: I en­joyed this; it’s a well-writ­ten pa­per that uses a few core in­ter­est­ing ideas (pos­te­rior sam­pling over a task dis­tri­bu­tion, rep­re­sen­ta­tion of a task dis­tri­bu­tion as a dis­tri­bu­tion of em­bed­ding vec­tors passed in to con­di­tion Q func­tions), and builds them up to make a method that achieves some im­pres­sive em­piri­cal re­sults.

Read more: Effi­cient Off-Policy Meta-RL via Prob­a­bil­is­tic Con­text Variables