[AN #65]: Learning useful skills by watching humans “play”

Link post

Find all Align­ment Newslet­ter re­sources here. In par­tic­u­lar, you can sign up, or look through this spread­sheet of all sum­maries that have ever been in the newslet­ter. I’m always happy to hear feed­back; you can send it to me by re­ply­ing to this email.

Au­dio ver­sion here (may not be up yet).


Learn­ing La­tent Plans from Play (Corey Lynch et al) (sum­ma­rized by Cody): This pa­per col­lects un­su­per­vised data of hu­mans play­ing with robotic con­trol sys­tems, and uses that data to thread a nee­dle be­tween two prob­lems in learn­ing. One prob­lem is that per-task demon­stra­tion data is costly, es­pe­cially as num­ber of tasks grows; the other is that ran­domly sam­pled con­trol ac­tions will rarely stum­ble across com­plex mo­tor tasks in ways that al­low robots to learn. The au­thors ar­gue that hu­man play data is a good com­pro­mise be­cause hu­mans at play tend to ex­plore differ­ent ways of ma­nipu­lat­ing ob­jects in ways that give robots nuggets of use­ful in­for­ma­tion like “how do I move this block in­side a drawer”, which can be com­posed into more com­pli­cated and in­ten­tional tasks.

The model works by learn­ing to pro­duce vec­tors that rep­re­sent plans (or se­quences of ac­tions), and jointly learn­ing to de­code those vec­tors into ac­tion se­quences. This ar­chi­tec­ture learns to gen­er­ate plan vec­tors by us­ing an au­toen­coder-like struc­ture that uses KL di­ver­gence to al­ign (1) a dis­tri­bu­tion of plan vec­tors pre­dicted from the start and end state of a win­dow of play data, and (2) a dis­tri­bu­tion of plan vec­tors pre­dicted by look­ing back at all the ac­tions taken in that win­dow. Be­cause we’re jointly learn­ing to un­roll the (2) look­back-sum­ma­rized vec­tor such that it matches the ac­tions ac­tu­ally taken, we’ll ideally end up with a sys­tem that can take in a given plan vec­tor and pro­duce a se­quence of ac­tions to ex­e­cute that plan. And, be­cause we’re learn­ing to pre­dict a vec­tor that al­igns with ac­tions suc­cess­fully taken to get to an end state from a start­ing one, the model at test time should be able to pro­duce a play vec­tor cor­re­spond­ing to fea­si­ble ac­tions that will get it from its cur­rent state to a goal state we’d like it to reach. The au­thors found that their Play-trained model was able to out­perform sin­gle-task mod­els on a range of ma­nipu­la­tion tasks, even though those sin­gle-task mod­els were trained with ex­plicit demon­stra­tions of the task.

Cody’s opinion: I re­ally liked this pa­per: it was cre­ative in com­bin­ing con­cep­tual com­po­nents from vari­a­tional meth­ods and imi­ta­tion learn­ing, and it was prag­matic in try­ing to ad­dress the prob­lem of how to get vi­able hu­man-demon­stra­tion data in a way that avoids hav­ing to get dis­tinct datasets for a huge set of differ­ent dis­crete tasks.

Tech­ni­cal AI alignment

Iter­ated amplification

Align­ing a toy model of op­ti­miza­tion (Paul Chris­ti­ano) (sum­ma­rized by Ro­hin): Cur­rent ML ca­pa­bil­ities are cen­tered around lo­cal search: we get a gra­di­ent (or an ap­prox­i­ma­tion to one, as with evolu­tion­ary al­gorithms), and take a step in that di­rec­tion to find a new model. Iter­ated am­plifi­ca­tion takes ad­van­tage of this fact: rather than a se­quence of gra­di­ent steps on a fixed re­ward, we can do a se­quence of am­plifi­ca­tion steps and dis­til­la­tion gra­di­ent steps.

How­ever, we can con­sider an even sim­pler model of ML ca­pa­bil­ities: func­tion max­i­miza­tion. Given a func­tion from n-bit strings to real num­bers, we model ML as al­low­ing us to find the in­put n-bit string with the max­i­mum out­put value, in only O(n) time (rather than the O(2^n) time that brute force search would take). If this were all we knew about ML ca­pa­bil­ities, could we still de­sign an al­igned, com­pet­i­tive ver­sion of it? While this is not the ac­tual prob­lem we face, due to its sim­plic­ity it is more amenable to the­o­ret­i­cal anal­y­sis, and so is worth think­ing about.

We could make an un­al­igned AI that max­i­mizes some ex­plicit re­ward us­ing only 2 calls to Opt: first, use Opt to find a good world model M that can pre­dict the dy­nam­ics and re­ward, and then use Opt to find a policy that does well when in­ter­act­ing with M. This is un­al­igned for all the usual rea­sons: most ob­vi­ously, it will try to seize con­trol of the re­ward chan­nel.

An al­igned ver­sion does need to use Opt, since that’s the only way of turn­ing a naively-ex­po­nen­tial search into a lin­ear one; with­out us­ing Opt the re­sult­ing sys­tem won’t be com­pet­i­tive. We can’t just gen­er­al­ize iter­ated am­plifi­ca­tion to this case, since iter­ated am­plifi­ca­tion re­lies on a se­quence of ap­pli­ca­tions of ML ca­pa­bil­ities: this would lead to an al­igned AI that uses Opt many times, which will not be com­pet­i­tive since the un­al­igned AI only re­quires 2 calls to Opt.

One pos­si­ble ap­proach is to de­sign an AI with good in­cen­tives (in the same way that iter­ated am­plifi­ca­tion aims to ap­prox­i­mate HCH (AN #34)) that “knows ev­ery­thing that the un­al­igned AI knows”. How­ever, it would also be use­ful to pro­duce a proof of im­pos­si­bil­ity: this would tell us some­thing about what a solu­tion must look like in more com­plex set­tings.

Ro­hin’s opinion: Amus­ingly, I liked this post pri­mar­ily be­cause com­par­ing this set­ting to the typ­i­cal set­ting for iter­ated am­plifi­ca­tion was use­ful for see­ing the de­sign choices and in­tu­itions that mo­ti­vated iter­ated am­plifi­ca­tion.


Co­or­di­na­tion Sur­veys: why we should sur­vey to or­ga­nize re­spon­si­bil­ities, not just pre­dic­tions (An­drew Critch) (sum­ma­rized by Ro­hin): This post sug­gests that when sur­vey­ing re­searchers about the fu­ture im­pact of their tech­nol­ogy, we should speci­fi­cally ask them about their be­liefs about what ac­tions other peo­ple will take, and what they per­son­ally are go­ing to do, rather than just pre­dict­ing to­tal im­pact. (For ex­am­ple, we could ask how many peo­ple will in­vest in safety.) Then, by ag­gre­gat­ing across sur­vey re­spon­dents, we can see whether or not the re­searchers be­liefs about what oth­ers will do match the em­piri­cal dis­tri­bu­tion of what re­searchers are plan­ning to do. This can help miti­gate the effect where ev­ery­one thinks that ev­ery­one else will deal with a prob­lem, and the effect where ev­ery­one tries to solve a prob­lem be­cause they all think no one else is plan­ning to solve it. Critch has offered to provide sug­ges­tions on in­clud­ing this method­ol­ogy in any up­com­ing sur­veys; see the post for de­tails.

Ro­hin’s opinion: This is a cool idea, and seems worth do­ing to me. I es­pe­cially like that the sur­vey would sim­ply re­veal prob­lems by col­lect­ing two sources of in­for­ma­tion from peo­ple and check­ing their con­sis­tency with each other: there isn’t any par­tic­u­lar ar­gu­ment be­ing made; you are sim­ply show­ing in­con­sis­tency in peo­ple’s own be­liefs to them, if and only if such in­con­sis­tency ex­ists. In prac­tice, I’m sure there will be com­pli­ca­tions—for ex­am­ple, per­haps the set of re­searchers tak­ing the sur­vey is differ­ent from the set of “oth­ers” whose ac­tions and be­liefs they are pre­dict­ing—but it still seems worth at least try­ing out.

AI Fore­cast­ing Dic­tionary (Ja­cob Lager­ros and Ben Gold­haber) (sum­ma­rized by Ro­hin): One big challenge with fore­cast­ing the fu­ture is op­er­a­tional­iz­ing key terms un­am­bigu­ously, so that a ques­tion can be re­solved when the fu­ture ac­tu­ally ar­rives. Since we’ll prob­a­bly need to fore­cast many differ­ent ques­tions, it’s cru­cial that we make it as easy as pos­si­ble to cre­ate and an­swer well-op­er­a­tional­ized ques­tions. To that end, the au­thors have cre­ated and open-sourced an AI Fore­cast­ing Dic­tionary, which gives pre­cise mean­ings for im­por­tant terms, along with ex­am­ples and non-ex­am­ples to clar­ify fur­ther.

AI Fore­cast­ing Re­s­olu­tion Coun­cil (Ja­cob Lager­ros and Ben Gold­haber) (sum­ma­rized by Ro­hin): Even if you op­er­a­tional­ize fore­cast­ing ques­tions well, of­ten the out­come is de­ter­mined pri­mar­ily by fac­tors other than the one you are in­ter­ested in. For ex­am­ple, progress on a bench­mark might be de­ter­mined more by the num­ber of re­searchers who try to beat the bench­mark than by im­prove­ments in AI ca­pa­bil­ities, even though you were try­ing to mea­sure the lat­ter. To deal with this prob­lem, an AI Fore­cast­ing Re­s­olu­tion Coun­cil has been set up: now, fore­cast­ers can pre­dict what the re­s­olu­tion coun­cil will say at some par­tic­u­lar time in the fu­ture. This al­lows for ques­tions that get at what we want: in the pre­vi­ous case, we could now fore­cast how the re­s­olu­tion coun­cil will an­swer the ques­tion “would cur­rent meth­ods be able to beat this bench­mark” in 2021.

How to write good AI fore­cast­ing ques­tions + Ques­tion Database (Ja­cob Lager­ros and Ben Gold­haber) (sum­ma­rized by Ro­hin): As dis­cussed above, op­er­a­tional­iza­tion of fore­cast­ing ques­tions is hard. This post col­lects some of the com­mon failure modes, and in­tro­duces a database of 76 ques­tions about AI progress that have de­tailed re­s­olu­tion crite­ria that will hope­fully avoid any pit­falls of op­er­a­tional­iza­tion.

Mis­cel­la­neous (Align­ment)

The strat­egy-steal­ing as­sump­tion (Paul Chris­ti­ano) (sum­ma­rized by Ro­hin): We of­ten talk about al­ign­ing AIs in a way that is com­pet­i­tive with un­al­igned AIs. How­ever, you might think that we need them to be bet­ter: af­ter all, un­al­igned AIs only have to pur­sue one par­tic­u­lar goal, whereas al­igned AIs have to deal with the fact that we don’t yet know what we want. We might hope that re­gard­less of what goal the un­al­igned AI has, any strat­egy it uses to achieve that goal can be turned into a strat­egy for ac­quiring flex­ible in­fluence (i.e. in­fluence use­ful for many goals). In that case, as long as we con­trol a ma­jor­ity of re­sources, we can use any strate­gies that the un­al­igned AIs can use. For ex­am­ple, if we con­trol 99% of the re­sources and un­al­igned AI con­trols 1%, then at the very least we can split up into 99 “coal­i­tions” that each con­trol 1% of re­sources and use the same strat­egy as the un­al­igned AI to ac­quire flex­ible in­fluence, and this should lead to us ob­tain­ing 99% of the re­sources in ex­pec­ta­tion. In prac­tice, we could do even bet­ter, e.g. by co­or­di­nat­ing to shut down any un­al­igned AI sys­tems.

The premise that we can use the same strat­egy as the un­al­igned AI, de­spite the fact that we need flex­ible in­fluence, is called the strat­egy-steal­ing as­sump­tion. Solv­ing the al­ign­ment prob­lem is crit­i­cal to strat­egy-steal­ing—oth­er­wise, un­al­igned AI would have an ad­van­tage at think­ing that we could not steal and the strat­egy-steal­ing as­sump­tion would break down. This post dis­cusses ten other ways that the strat­egy-steal­ing as­sump­tion could fail. For ex­am­ple, the un­al­igned AI could pur­sue a strat­egy that in­volves threat­en­ing to kill hu­mans, and we might not be able to use a similar strat­egy in re­sponse be­cause the un­al­igned AI might not be as frag­ile as we are.

Ro­hin’s opinion: It does seem to me that if we’re in a situ­a­tion where we have solved the al­ign­ment prob­lem, we con­trol 99% of re­sources, and we aren’t in­fight­ing amongst each other, we will likely con­tinue to con­trol at least 99% of the re­sources in the fu­ture. I’m a lit­tle con­fused about how we get to this situ­a­tion though—the sce­nar­ios I usu­ally worry about are the ones in which we fail to solve the al­ign­ment prob­lem, but still de­ploy un­al­igned AIs, and in these sce­nar­ios I’d ex­pect un­al­igned AIs to get the ma­jor­ity of the re­sources. I sup­pose in a mul­ti­po­lar set­ting with con­tin­u­ous take­off, if we have mostly solved the al­ign­ment prob­lem but still ac­ci­den­tally cre­ate un­al­igned AIs (or some mal­i­cious ac­tors cre­ate them de­liber­ately), then this set­ting where we con­trol 99% of the re­sources could arise.

Other progress in AI


Mak­ing Effi­cient Use of De­mon­stra­tions to Solve Hard Ex­plo­ra­tion Prob­lems (Caglar Gul­cehre, Tom Le Paine et al) (sum­ma­rized by Cody): This pa­per com­bines ideas from ex­ist­ing tech­niques to con­struct an ar­chi­tec­ture (R2D3) ca­pa­ble of learn­ing to solve hard ex­plo­ra­tion prob­lems with a small num­ber (N~100) of demon­stra­tions. R2D3 has two pri­mary ar­chi­tec­tural fea­tures: its use of a re­cur­rent head to learn Q val­ues, and its strat­egy of sam­pling tra­jec­to­ries from sep­a­rate pools of agent and demon­stra­tor ex­pe­rience, with sam­pling pri­ori­tized by high­est-tem­po­ral-differ­ence-er­ror tran­si­tions within each pool.

As the au­thors note, this ap­proach is es­sen­tially an ex­ten­sion of an ear­lier pa­per, Deep Q-Learn­ing from De­mon­stra­tions, to use a re­cur­rent head rather than a feed-for­ward one, al­low­ing it to be more effec­tively de­ployed on par­tial-in­for­ma­tion en­vi­ron­ments. The au­thors test on 8 differ­ent en­vi­ron­ments that re­quire long se­quences of task com­ple­tion to re­ceive any re­ward, and find that their ap­proach is able to reach hu­man level perfor­mance on four of the tasks, while their baseline com­par­i­sons es­sen­tially never suc­ceed on any task. Lev­er­ag­ing demon­stra­tions can be valuable for solv­ing these kinds of difficult ex­plo­ra­tion tasks, be­cause demon­stra­tor tra­jec­to­ries provide ex­am­ples of how to achieve re­ward in a set­ting where the tra­jec­to­ries of a ran­domly ex­plor­ing agent would rarely ever reach the end of the task to find pos­i­tive re­ward.

Cody’s opinion: For all that this pa­per’s tech­nique is a fairly straight­for­ward merg­ing of ex­ist­ing tech­niques (sep­a­rately-pri­ori­tized demon­stra­tion and agent pools, and the off-policy SotA R2D2), its re­sults are sur­pris­ingly im­pres­sive: the tasks tested on re­quire long and com­plex chains of cor­rect ac­tions that would be challeng­ing for a non-imi­ta­tion based sys­tem to dis­cover, and high lev­els of en­vi­ron­ment stochas­tic­ity that make a pure imi­ta­tion ap­proach difficult.

Re­in­force­ment learning

Emer­gent Tool Use from Multi-Agent In­ter­ac­tion (Bowen Baker et al) (sum­ma­rized by Ro­hin): We have such a vast di­ver­sity of or­ganisms and be­hav­iors on Earth be­cause of evolu­tion: ev­ery time a new strat­egy evolved, it cre­ated new pres­sures and in­cen­tives for other or­ganisms, lead­ing to new be­hav­iors. The mul­ti­a­gent com­pe­ti­tion led to an au­tocur­ricu­lum. This work har­nesses this effect: they de­sign a mul­ti­a­gent en­vi­ron­ment and task, and then use stan­dard RL al­gorithms to learn sev­eral in­ter­est­ing be­hav­iors. Their task is hide-and-seek, where the agents are able to move boxes, walls and ramps, and lock ob­jects in place. The agents find six differ­ent strate­gies, each emerg­ing from in­cen­tives cre­ated by the pre­vi­ous strat­egy: seek­ers chas­ing hiders, hiders build­ing shelters, seek­ers us­ing ramps to get into shelters, hiders lock­ing ramps away from seek­ers, seek­ers sur­fing boxes to hiders, and hiders lock­ing both boxes and ramps.

The hope is that this can be used to learn gen­eral skills that can then be used for spe­cific tasks. This makes it a form of un­su­per­vised learn­ing, with a similar goal as e.g. cu­ri­os­ity (AN #20). We might hope that mul­ti­a­gent au­tocur­ricula would do bet­ter than cu­ri­os­ity, be­cause they au­to­mat­i­cally tend to use fea­tures that are im­por­tant for con­trol in the en­vi­ron­ment (such as ramps and boxes), while in­trin­sic mo­ti­va­tion meth­ods of­ten end up fo­cus­ing on fea­tures we wouldn’t think are par­tic­u­larly im­por­tant. They em­piri­cally test this by de­sign­ing five tasks in the en­vi­ron­ment and check­ing whether fine­tun­ing the agents from the mul­ti­a­gent au­tocur­ricula learns faster than di­rect train­ing and fine­tun­ing cu­ri­os­ity-based agents. They find that the mul­ti­a­gent au­tocur­ricula agents do best, but only slightly. To ex­plain this, they hy­poth­e­size that the learned skill rep­re­sen­ta­tions are still highly en­tan­gled and so are hard to fine­tune, whereas learned fea­ture rep­re­sen­ta­tions trans­fer more eas­ily.

Ro­hin’s opinion: This is some­what similar to AI-GAs (AN #63): both de­pend on en­vi­ron­ment de­sign, which so far has been rel­a­tively ne­glected. How­ever, AI-GAs are hop­ing to cre­ate learn­ing al­gorithms, while mul­ti­a­gent au­tocur­ricula leads to tool use, at least in this case. Another point of similar­ity is that they both re­quire vast amounts of com­pute, as dis­cov­er­ing new strate­gies can take sig­nifi­cant ex­plo­ra­tion. That said, it seems that we might be able to dras­ti­cally de­crease the amount of com­pute needed by solv­ing the ex­plo­ra­tion prob­lem us­ing e.g. hu­man play data or demon­stra­tions (dis­cussed in two differ­ent pa­pers above).

More spec­u­la­tively, I hy­poth­e­size that it will be use­ful to have en­vi­ron­ments where you need to iden­tify what strat­egy your op­po­nent is us­ing. In this en­vi­ron­ment, each strat­egy has the prop­erty that it beats all of the strate­gies that pre­ceded it. As a re­sult, it was fine for the agent to un­dergo catas­trophic for­get­ting: even though it was trained against past agents, it only needed to learn the cur­rent strat­egy well; it didn’t need to re­mem­ber pre­vi­ous strate­gies. As a re­sult, it may have for­got­ten prior strate­gies and skills, which might have re­duced its abil­ity to learn new tasks quickly.

Read more: Paper: Emer­gent Tool Use from Multi-Agent Au­tocur­ricula, Vox: Watch an AI learn to play hide-and-seek


Tack­ling Cli­mate Change with Ma­chine Learn­ing (David Rolnick et al) (sum­ma­rized by Ro­hin): See Im­port AI.

No comments.