Alignment Newsletter #30

Link post


Learn­ing Com­plex Goals with Iter­ated Am­plifi­ca­tion (Paul Chris­ti­ano et al): This blog post and the ac­com­pa­ny­ing pa­per in­tro­duces iter­ated am­plifi­ca­tion, fo­cus­ing on how it can be used to define a train­ing sig­nal for tasks that hu­mans can­not perform or eval­u­ate, such as de­sign­ing a tran­sit sys­tem. The key in­sight is that hu­mans are ca­pa­ble of de­com­pos­ing even very difficult tasks into slightly sim­pler tasks. So, in the­ory, we could provide ground truth la­bels for an ar­bi­trar­ily difficult task by a huge tree of hu­mans, each de­com­pos­ing their own sub­ques­tion and hand­ing off new sub­ques­tions to other hu­mans, un­til ques­tions are easy enough that a hu­man can di­rectly an­swer them.

We can turn this into an effi­cient al­gorithm by hav­ing the hu­man de­com­pose the ques­tion only once, and us­ing the cur­rent AI sys­tem to an­swer the gen­er­ated sub­ques­tions. If the AI isn’t able to an­swer the sub­ques­tions, then the hu­man will get non­sense an­swers. How­ever, as long as there are ques­tions that the hu­man + AI sys­tem can an­swer but the AI alone can­not an­swer, the AI can learn from the an­swers to those ques­tions. To re­duce the re­li­ance on hu­man data, an­other model is trained to pre­dict the de­com­po­si­tion that the hu­man performs. In ad­di­tion, some tasks could re­fer to a large con­text (eg. eval­u­at­ing safety for a spe­cific rocket de­sign), so they model the hu­man as be­ing able to ac­cess small pieces of the con­text at a time.

They eval­u­ate on sim­ple al­gorith­mic tasks like dis­tance be­tween nodes in a graph, where they can pro­gram an au­to­mated hu­man de­com­po­si­tion for faster ex­per­i­ments, and there is a ground truth solu­tion. They com­pare against su­per­vised learn­ing, which trains a model on the ground truth an­swers to ques­tions (which iter­ated am­plifi­ca­tion does not have ac­cess to), and find that they can match the perfor­mance of su­per­vised learn­ing with only slightly more train­ing steps.

Ro­hin’s opinion: This is my new fa­vorite post/​pa­per for ex­plain­ing how iter­ated am­plifi­ca­tion works, since it very suc­cinctly and clearly makes the case for iter­ated am­plifi­ca­tion as a strat­egy for gen­er­at­ing a good train­ing sig­nal. I’d recom­mend read­ing the pa­per in full, as it makes other im­por­tant points that I haven’t in­cluded in the sum­mary.

Note that it does not ex­plain a lot of Paul’s think­ing. It ex­plains one par­tic­u­lar train­ing method that al­lows you to train an AI sys­tem with a more in­tel­li­gent and in­formed over­seer.

Re­la­tional in­duc­tive bi­ases, deep learn­ing, and graph net­works (Peter W. Bat­taglia et al) (sum­ma­rized by Richard): “Part po­si­tion pa­per, part re­view, and part unifi­ca­tion”, this pa­per em­pha­sises the im­por­tance of com­bi­na­to­rial gen­er­al­i­sa­tion, which is key to how hu­mans un­der­stand the world. It ar­gues for ap­proaches which perform com­pu­ta­tion over dis­crete en­tities and the re­la­tions be­tween them, such as graph net­works. The au­thors claim that CNNs and RNNs are so suc­cess­ful due to re­la­tional in­duc­tive bi­ases—for ex­am­ple, the bias to­wards lo­cal struc­ture in­duced by con­volu­tional lay­ers. Graph net­works are promis­ing be­cause they can ex­press ar­bi­trary re­la­tional bi­ases: any nodes can be con­nected with any oth­ers de­pend­ing on the struc­ture of the prob­lem. Fur­ther, since graph net­works learn func­tions which are reused for all nodes and edges, each one can be ap­plied to graphs of any shape and size: a form of com­bi­na­to­rial gen­er­al­i­sa­tion.

In this pa­per’s frame­work, each ‘graph block’ does com­pu­ta­tions over an in­put graph and re­turns an out­put graph. The rele­vant part of the out­put might be the val­ues of edges, or those of nodes, or ‘global’ prop­er­ties of the over­all graph. Graph blocks can be im­ple­mented by stan­dard neu­ral net­work ar­chi­tec­tures or more un­usual ones such as mes­sage-pass­ing neu­ral net­works or non-lo­cal neu­ral net­works. The au­thors note some ma­jor open ques­tions: how to gen­er­ate the graphs in the first place, and how to adap­tively mod­ify them dur­ing the course of com­pu­ta­tion.

Richard’s opinion: This pa­per is an ex­cel­lent holis­tic dis­cus­sion of graph net­works and rea­sons to think they are promis­ing. I’m glad that it also men­tioned the open prob­lems, though, since I think they’re pretty cru­cial to us­ing graphs in deep learn­ing, and cur­rent ap­proaches in this area (e.g. cap­sule net­works’ dy­namic con­trol flow) aren’t satis­fac­tory.

Tech­ni­cal AI alignment

Iter­ated amplification

Learn­ing Com­plex Goals with Iter­ated Am­plifi­ca­tion (Paul Chris­ti­ano et al): Sum­ma­rized in the high­lights!

Agent foundations

When EDT=CDT, ADT Does Well (Diffrac­tor)

Learn­ing hu­man intent

One-Shot Ob­ser­va­tion Learn­ing (Leo Pauly et al)

Prevent­ing bad behavior

Safe Re­in­force­ment Learn­ing with Model Uncer­tainty Es­ti­mates (Björn Lüt­jens et al)

Ad­dress­ing three prob­lems with coun­ter­fac­tual cor­rigi­bil­ity: bad bets, defend­ing against back­stops, and over­con­fi­dence. (Ryan Carey)


Learn­ing from Un­trusted Data (Char­ikar, Stein­hardt, and Vali­ant) (sum­ma­rized by Dan H): This pa­per in­tro­duces semi-ver­ified learn­ing. Here a model learns from a ver­ified or trusted dataset, and from an un­trusted dataset which con­sists in a mix­ture of le­gi­t­i­mate and ar­bi­trary ex­am­ples. For the un­trusted dataset, it is not known which points are le­gi­t­i­mate and which are not. This sce­nario can oc­cur when data is scraped from the in­ter­net, recorded by un­re­li­able de­vices, or gath­ered through crowd­sourc­ing. Con­cretely if a (pos­si­bly small) frac­tion of the scraped data is hand-la­beled, then this could count as the trusted set, and the re­main­ing data could be con­sid­ered the un­trusted set. This differs from semi-su­per­vised learn­ing where there are la­beled and un­la­beled task-rele­vant ex­am­ples. Here there are trusted ex­am­ples and ex­am­ples which are un­trusted (e.g., la­bels may be wrong, fea­tures may be out-of-dis­tri­bu­tion, ex­am­ples may be mal­i­cious, and so on). See the full pa­per for the­o­rems and an al­gorithm ap­pli­ca­ble to tasks such as ro­bust den­sity es­ti­ma­tion.

Dan H’s opinion: The semi-ver­ified model seems highly use­ful for var­i­ous safety-re­lated sce­nar­ios in­clud­ing learn­ing with la­bel cor­rup­tion, poi­soned in­put data, and min­i­mal su­per­vi­sion.


Do Deep Gen­er­a­tive Models Know What They Don’t Know? (Eric Nal­is­nick et al)

Read more: Sec­tion 4.3 of this pa­per makes similar ob­ser­va­tions and ame­lio­rates the is­sue. This pa­per also demon­strates the frag­ility of den­sity es­ti­ma­tors on out-of-dis­tri­bu­tion data.


Thoughts on short timelines (To­bias Bau­mann): This post ar­gues that the prob­a­bil­ity of AGI in the next ten years is very low, per­haps 1-2%. The pri­mary ar­gu­ment is that to get AGI that quickly, we would need to be see­ing re­search break­throughs fre­quently, and em­piri­cally this is not the case. This might not be true if we ex­pect that progress will ac­cel­er­ate in the fu­ture, but there’s no rea­son to ex­pect this—we won’t get re­cur­sive self-im­prove­ment be­fore AGI and there won’t be a huge in­crease in re­sources de­voted to AI (since there is already so much ex­cite­ment). We might also say that we are so clue­less that we should as­sign at least 10% to AGI in ten years, but it doesn’t seem we are that ig­no­rant, and in any case it’s not ob­vi­ous that a prior should as­sign 10% to this out­come. Ex­pert sur­veys es­ti­mate non-neg­ligible prob­a­bil­ity on AGI in ten years, but in prac­tice it seems the pre­dom­i­nant opinion is to con­fi­dently dis­miss a short timelines sce­nario.

Ro­hin’s opinion: I do think that the prob­a­bil­ity of AGI in ten years is larger than 1-2%. I sus­pect my main dis­agree­ment is with the con­cep­tion of what counts as ground­break­ing progress. To­bias gives the ex­am­ple of trans­fer from one board game to many other board games; I think that AGI wouldn’t be able to solve this prob­lem from scratch, and hu­mans are only ca­pa­ble of this be­cause of good pri­ors from all the other learn­ing we’ve done through­out life, es­pe­cially since games are de­signed to be hu­man-un­der­stand­able. If you make a suffi­ciently large neu­ral net and give it a com­plex enough en­vi­ron­ment, some sim­ple un­su­per­vised learn­ing re­wards, and the op­por­tu­nity to col­lect as much data as a hu­man gets through­out life, maybe that does re­sult in AGI. (I’d guess not, be­cause it does seem like we have some good pri­ors from birth, but I’m not very con­fi­dent in that.)

Other progress in AI


Cu­ri­os­ity and Pro­cras­ti­na­tion in Re­in­force­ment Learn­ing (Niko­lay Sav­inov and Ti­mothy Lillicrap): This blog post ex­plains Epi­sodic Cu­ri­os­ity through Reach­a­bil­ity, dis­cussed in AN #28. As a re­minder, this method trains a neu­ral net to pre­dict whether two ob­ser­va­tions were close in time to each other. Re­cent ob­ser­va­tions are stored in mem­ory, and the agent is re­warded for reach­ing states that are pre­dicted to be far away from any ob­ser­va­tions in mem­ory.

Ro­hin’s opinion: This is eas­ier to read than the pa­per and more in­for­ma­tive than our sum­maries, so I’d recom­mend it if you were in­ter­ested in the pa­per.

Suc­ces­sor Uncer­tain­ties: ex­plo­ra­tion and un­cer­tainty in tem­po­ral differ­ence learn­ing (David Janz et al)

Deep learning

Re­la­tional in­duc­tive bi­ases, deep learn­ing, and graph net­works (Peter W. Bat­taglia et al): Sum­ma­rized in the high­lights!

Re­la­tional re­cur­rent neu­ral net­works (Adam San­toro, Ryan Faulkner, David Ra­poso et al) (sum­ma­rized by Richard): This pa­per in­tro­duces the Re­la­tional Me­mory Core, which al­lows in­ter­ac­tions be­tween mem­o­ries stored in mem­ory-based neu­ral net­works. It does so us­ing a “self-at­ten­tion mechanism”: each mem­ory up­dates its con­tents by at­tend­ing to all other mem­o­ries via sev­eral “at­ten­tion heads” which fo­cus on differ­ent fea­tures. This leads to par­tic­u­larly good perfor­mance on the nth-farthest task, which re­quires the rank­ing of pair­wise dis­tances be­tween a set of vec­tors (91% ac­cu­racy, com­pared with baseline 30%), and the Mini-Pac­man task.

Richard’s opinion: While perfor­mance is good on small prob­lems, com­par­ing ev­ery mem­ory to ev­ery other doesn’t scale well (a con­cern the au­thors also men­tion in their dis­cus­sion). It re­mains to be seen how prun­ing older mem­o­ries af­fects perfor­mance.

Re­la­tional Deep Re­in­force­ment Learn­ing (Vini­cius Zam­baldi, David Ra­poso, Adam San­toro et al) (sum­ma­rized by Richard): This pa­per uses the self-at­ten­tion mechanism dis­cussed in ‘Re­la­tional re­cur­rent neu­ral net­works’ to com­pute re­la­tion­ships be­tween en­tities ex­tracted from in­put data. The sys­tem was tested on the Box-World en­vi­ron­ment, in which an agent needs to use keys to open boxes in a cer­tain or­der. It gen­er­al­ised very well to test en­vi­ron­ments which re­quired much longer se­quences of ac­tions than any train­ing ex­am­ples, and im­proved slightly on a baseline for Star­craft mini-games.

Richard’s opinion: Get­ting neu­ral net­works to gen­er­al­ise to longer ver­sions of train­ing prob­lems is of­ten sur­pris­ingly difficult, so I’m im­pressed by the Box-World re­sults; I would have liked to see what hap­pened on even longer prob­lems.

Re­la­tional in­duc­tive bias for phys­i­cal con­struc­tion in hu­mans and ma­chines (Jes­sica B. Ham­rick, Kel­sey R. Allen et al)


Ap­ply­ing Deep Learn­ing To Airbnb Search (Malay Hal­dar)

Ma­chine learning

Fluid An­no­ta­tion: An Ex­plo­ra­tory Ma­chine Learn­ing–Pow­ered In­ter­face for Faster Image An­no­ta­tion (Jasper Uijlings and Vit­to­rio Fer­rari): This post de­scribes a sys­tem that can be used to help hu­mans la­bel images to gen­er­ate la­bels for seg­men­ta­tion. The post sum­ma­rizes it well: “Fluid An­no­ta­tion starts from the out­put of a strong se­man­tic seg­men­ta­tion model, which a hu­man an­no­ta­tor can mod­ify through ma­chine-as­sisted edit op­er­a­tions us­ing a nat­u­ral user in­ter­face. Our in­ter­face em­pow­ers an­no­ta­tors to choose what to cor­rect and in which or­der, al­low­ing them to effec­tively fo­cus their efforts on what the ma­chine does not already know.”

Ro­hin’s opinion: I’m ex­cited about tech­niques like this that al­low us to scale up AI sys­tems with less hu­man effort, by fo­cus­ing hu­man effort on the as­pects of the prob­lem that AI can­not yet solve, while us­ing ex­ist­ing AI sys­tems to do the low-level work (gen­er­at­ing a short­list of po­ten­tial seg­men­ta­tions, in this case). This is an ex­am­ple of the paradigm of us­ing AI to help hu­mans more effec­tively cre­ate bet­ter AI, which is one of the key ideas un­der­ly­ing iter­ated am­plifi­ca­tion. (Though iter­ated am­plifi­ca­tion fo­cuses on how to use ex­ist­ing AI sys­tems to al­low the hu­man to provide a train­ing sig­nal for tasks that hu­mans can­not perform or eval­u­ate them­selves.)