[AN #56] Should ML researchers stop running experiments before making hypotheses?

Link post

Find all Align­ment Newslet­ter re­sources here. In par­tic­u­lar, you can sign up, or look through this spread­sheet of all sum­maries that have ever been in the newslet­ter.

Highlights

HARK Side of Deep Learn­ing—From Grad Stu­dent Des­cent to Au­to­mated Ma­chine Learn­ing (Oguzhan Gencoglu et al): This pa­per fo­cuses on the nega­tive effects of Hy­poth­e­siz­ing After the Re­sults are Known (HARKing), a pat­tern in which re­searchers first con­duct ex­per­i­ments and view the re­sults, and once they have hit the bar to be pub­lish­able, a hy­poth­e­sis is con­structed af­ter the fact to ex­plain the re­sults. It ar­gues that HARKing is com­mon in ma­chine learn­ing, and that this has nega­tive effects on the field as a whole. First, im­prove­ments to state-of-the-art (SotA) may be ques­tion­able be­cause they could have been caused by suffi­cient hy­per­pa­ram­e­ter tun­ing via grad stu­dent de­scent, in­stead of the new idea in a pa­per to which the gain is at­tributed. Se­cond, there is pub­li­ca­tion bias since only pos­i­tive re­sults are re­ported in con­fer­ences, which pre­vents us from learn­ing from nega­tive re­sults. Third, hy­pothe­ses that are tai­lored to fit re­sults for a sin­gle dataset or task are much less likely to gen­er­al­ize to new datasets or tasks. Fourth, while Au­toML sys­tems achieve good re­sults, we can­not figure out what makes them work be­cause the high com­pute re­quire­ments make ab­la­tion stud­ies much harder to perform. Fi­nally, they ar­gue that we need to fix HARKing in or­der to achieve things like eth­i­cal AI, hu­man-cen­tric AI, re­pro­ducible AI, etc.

Ro­hin’s opinion: I be­lieve that I found this pa­per the very first time I looked for generic new in­ter­est­ing pa­pers af­ter I started think­ing about this prob­lem, which was quite the co­in­ci­dence. I’m re­ally happy that the au­thors wrote the pa­per—it’s not in their in­cen­tives (as far as I can tell), but the topic seems cru­cial to ad­dress.

That said, I dis­agree with the pa­per on a few counts. The au­thors don’t ac­knowl­edge the value of HARKing—of­ten it is use­ful to run many ex­per­i­ments and see what hap­pens in or­der to de­velop a good the­ory. Hu­mans are not ideal Bayesian rea­son­ers who can con­sider all hy­pothe­ses at once; we of­ten re­quire many ob­ser­va­tions in or­der to even hy­poth­e­size a the­ory. The au­thors make the point that in other fields HARKing leads to bad re­sults, but ML is sig­nifi­cantly differ­ent in that we can run ex­per­i­ments much faster with a much higher iter­a­tion speed.

If we were in­stead forced to pre­reg­ister stud­ies, as the au­thors sug­gest, the iter­a­tion speed would drop by an or­der of mag­ni­tude or two; I se­ri­ously doubt that the benefits would out­weigh the cost of lower iter­a­tion speed. In­stead of pre­reg­is­ter­ing all ex­per­i­ments, maybe re­searchers could run ex­per­i­ments and ob­serve re­sults, for­mu­late a the­ory, and then pre­reg­ister an ex­per­i­ment that would test the the­ory—but in this case I would ex­pect that re­searchers end up “pre­reg­is­ter­ing” ex­per­i­ments that are very similar to the ex­per­i­ments that gen­er­ated the the­ory, such that the re­sults are very likely to come out in sup­port of the the­ory.

(This does not re­quire any ac­tive mal­ice on the part of the re­searchers—it’s nat­u­ral to think of pre­dic­tions of the the­ory in the do­main where you de­vel­oped the the­ory. For ex­am­ple, in our re­cent pa­per (AN #45), we ex­plic­itly de­signed four en­vi­ron­ments where we ex­pected our method to work and one where it wouldn’t.)

Another point: I think that the un­der­ly­ing cause of HARKing is the in­cen­tive to chaise SotA, and if I were writ­ing this pa­per I would fo­cus on that. For ex­am­ple, I be­lieve that the bias to­wards SotA chas­ing causes HARKing, and not the other way around. (I’m not sure if the au­thors be­lieve oth­er­wise; the pa­per isn’t very clear on this point.) This is also a more di­rect ex­pla­na­tion of re­sults be­ing caused by grad stu­dent de­scent or hy­per­pa­ram­e­ter tun­ing; the HARKing in such pa­pers oc­cur be­cause it isn’t ac­cept­able to say “we ob­tained this re­sult via grad stu­dent de­scent”, be­cause that would not be a con­tri­bu­tion to the field.

Although I’ve been cri­tiquing the pa­per, over­all I find my be­liefs much closer to the au­thors’ than the “be­liefs of the field”. (Not the be­liefs of re­searchers in the field: I sus­pect many re­searchers would agree that HARKing has nega­tive effects, even though the in­cen­tives force re­searchers to do so in or­der to get pa­pers pub­lished.) I’d be in­ter­ested in ex­plor­ing the topic fur­ther, but don’t have enough time to do so my­self—if you’re in­ter­ested in build­ing toy mod­els of the re­search field and mod­el­ing the effect of in­ter­ven­tions on the field, re­ply to this email and we can see if it would make sense to col­lab­o­rate.

Tech­ni­cal AI alignment

Problems

Agency Failure AI Apoca­lypse? (Robin Han­son): This is a re­sponse to More re­al­is­tic tales of doom (AN #50), ar­gu­ing that the sce­nar­ios de­scribed in the post are un­re­al­is­tic given what we know about prin­ci­pal-agent prob­lems. In a typ­i­cal prin­ci­pal-agent prob­lem, the prin­ci­pal doesn’t know ev­ery­thing about the agent, and the agent can use this fact to gain “agency rents” where it can gain ex­tra value for it­self, or there could be an “agency failure” where the prin­ci­pal doesn’t get as much as they want. For ex­am­ple, an em­ployee might spend half of their day brows­ing the web, be­cause their man­ager can’t tell that that’s what they are do­ing. Our eco­nomic liter­a­ture on prin­ci­pal-agent prob­lems sug­gests that agency prob­lems get harder with more in­for­ma­tion asym­me­try, more noise in out­comes, etc. but not with smarter agents, and in any case we typ­i­cally see limited agency rents and failures. So, it’s un­likely that the case for AI will be any differ­ent, and while it’s good to have a cou­ple of peo­ple keep­ing an eye on the prob­lem, it’s not worth the large in­vest­ment of re­sources from fu­ture-ori­ented peo­ple that we cur­rently see.

Ro­hin’s opinion: I have a bunch of com­pli­cated thoughts on this post, many of which were said in Paul’s com­ment re­ply to the post, but I’ll say a few things. Firstly, I think that if you want to view the AI al­ign­ment prob­lem in the con­text of the prin­ci­pal-agent liter­a­ture, the nat­u­ral way to think about it is with the prin­ci­pal be­ing less ra­tio­nal than the agent. I claim that it is at least con­ceiv­able that an AI sys­tem could make hu­mans worse off, but the stan­dard prin­ci­pal-agent model can­not ac­com­mo­date such a sce­nario be­cause it as­sumes the prin­ci­pal is ra­tio­nal, which means the prin­ci­pal always does at least as well as not ced­ing any con­trol to the agent at all. More im­por­tantly, al­though I’m not too fa­mil­iar with the prin­ci­pal-agent liter­a­ture, I’m guess­ing that the liter­a­ture as­sumes the pres­ence of norms, laws and in­sti­tu­tions that con­strain both the prin­ci­pal and the agent, and in such cases it makes sense that the loss that the prin­ci­pal could in­cur would be bounded—but it’s not ob­vi­ous that this would hold for suffi­ciently pow­er­ful AI sys­tems.

Learn­ing hu­man intent

Batch Ac­tive Prefer­ence-Based Learn­ing of Re­ward Func­tions (Er­dem Bıyık et al) (sum­ma­rized by Cody): This pa­per builds on a trend of re­cent pa­pers that try to learn hu­man prefer­ences, not through demon­stra­tions of op­ti­mal be­hav­ior, but through a hu­man ex­press­ing a prefer­ence over two pos­si­ble tra­jec­to­ries, which has both prag­matic ad­van­tages (re limits of hu­man op­ti­mal­ity) and the­o­retic ones (bet­ter abil­ity to ex­trap­o­late a re­ward func­tion). Here, the task is framed as: we want to send hu­mans batches of paired tra­jec­to­ries to rank, but which ones? Batch learn­ing is prefer­able to sin­gle-sam­ple ac­tive learn­ing be­cause it’s more effi­cient to up­date a net­work af­ter a batch of hu­man judg­ments, rather than af­ter each sin­gle one. This adds com­plex­ity to the prob­lem be­cause you’d pre­fer to not have a batch of sam­ples that are in­di­vi­d­u­ally high-ex­pected-in­for­ma­tion, but which are re­dun­dant with one an­other. The au­thors define an in­for­ma­tion crite­rion (ba­si­cally the ex­am­ples about which we’re most un­cer­tain of the hu­man’s judg­ment) and then pick a batch of ex­am­ples based on differ­ent heuris­tics for get­ting a set of tra­jec­to­ries with high in­for­ma­tion con­tent that are sep­a­rated from each other in fea­ture space.

Cody’s opinion: This is an el­e­gant pa­per that makes good use of the toolkit of ac­tive learn­ing for hu­man prefer­ence so­lic­i­ta­tion, but it’s batch heuris­tics are all very re­li­ant on hav­ing a set of high level tra­jec­tory fea­tures in which Eu­clidean dis­tance be­tween points is a mean­ingful similar­ity met­ric, which feels like a not im­pos­si­ble to gen­er­al­ize but still some­what limit­ing con­straint.

Pr­ereq­ui­si­ties: Ac­tive Prefer­ence-Based Learn­ing of Re­ward Func­tions (Re­con #5)

Train­ing hu­man mod­els is an un­solved prob­lem (Char­lie Steiner)

Other progress in AI

Re­in­force­ment learning

NeurIPS 2019 Com­pe­ti­tion: The MineRL Com­pe­ti­tion on Sam­ple Effi­cient Re­in­force­ment Learn­ing us­ing Hu­man Pri­ors (William H. Guss et al): In this challenge which is slated to start on June 1, com­peti­tors will try to build agents that ob­tain a di­a­mond in Minecraft, with­out us­ing too much en­vi­ron­ment in­ter­ac­tion. This is an in­cred­ibly difficult task: in or­der to make this fea­si­ble, the com­pe­ti­tion also pro­vides a large amount of hu­man demon­stra­tions. They also have a list of sim­pler tasks that will likely be pre­req­ui­sites to ob­tain­ing a di­a­mond, such as nav­i­gat­ing, chop­ping trees, ob­tain­ing an iron pick­axe, and ob­tain­ing cooked meat, for which they also col­lect demon­stra­tions of hu­man game­play. As the name sug­gests, the au­thors hope that the com­pe­ti­tion will spur re­searchers into em­bed­ding hu­man pri­ors into gen­eral al­gorithms in or­der to get sam­ple effi­cient learn­ing.

Ro­hin’s opinion: I re­ally like the po­ten­tial of Minecraft as a deep RL re­search en­vi­ron­ment, and I’m glad that there’s fi­nally a bench­mark /​ com­pe­ti­tion that takes ad­van­tage of Minecraft be­ing very open world and hi­er­ar­chi­cal. The tasks that they define are very challeng­ing; there are ways in which it is harder than Dota (no self-play cur­ricu­lum, learn­ing from pix­els in­stead of states, more ex­plicit hi­er­ar­chy) and ways in which it is eas­ier (slightly shorter epi­sodes, smaller ac­tion space, don’t have to be adap­tive based on op­po­nents). Of course, the hope is that with demon­stra­tions of hu­man game­play, it will not be nec­es­sary to use as much com­pute as was nec­es­sary to solve Dota (AN #54).

I also like the em­pha­sis on how to lev­er­age hu­man pri­ors within gen­eral learn­ing al­gorithms: I share the au­thors’ in­tu­ition that hu­man pri­ors can lead to sig­nifi­cant gains in sam­ple effi­ciency. I sus­pect that, at least for the near fu­ture, many of the most im­por­tant ap­pli­ca­tions of AI will ei­ther in­volve hard­coded struc­ture im­posed by hu­mans, or will in­volve gen­eral al­gorithms that lev­er­age hu­man pri­ors, rather than be­ing learned “from scratch” via e.g. RL.

Toy­box: A Suite of En­vi­ron­ments for Ex­per­i­men­tal Eval­u­a­tion of Deep Re­in­force­ment Learn­ing (Emma Tosch et al): Toy­box is a reim­ple­men­ta­tion of three Atari games (Break­out, Ami­dar and Space In­vaders) that en­ables re­searchers to cus­tomize the games them­selves in or­der to perform bet­ter ex­per­i­men­tal eval­u­a­tions of RL agents. They demon­strate its util­ity us­ing a case study for each game. For ex­am­ple, in Break­out we of­ten hear that the agents learn to “tun­nel” through the layer of bricks so that the ball bounces around the top of the screen de­stroy­ing many bricks. To test whether the agent has learned a ro­bust tun­nel­ing be­hav­ior, they train an agent nor­mally, and then at test time they re­move all but one brick of a column and see if the agent quickly de­stroys the last brick to cre­ate a tun­nel. It turns out that the agent only does this for the cen­ter column, and some­times for the one di­rectly to its left.

Ro­hin’s opinion: I re­ally like the idea of be­ing able to eas­ily test whether an agent has ro­bustly learned a be­hav­ior or not. To some ex­tent, all of the trans­fer learn­ing en­vi­ron­ments are also do­ing this, such as CoinRun (AN #36) and the Retro Con­test (AN #1): if the learned be­hav­ior is not ro­bust, then the agent will not perform well in the trans­fer en­vi­ron­ment. But with Toy­box it looks like re­searchers will be able to run much more gran­u­lar ex­per­i­ments look­ing at spe­cific be­hav­iors.

Smooth­ing Poli­cies and Safe Policy Gra­di­ents (Mat­teo Pap­ini et al)

Deep learning

Gen­er­a­tive Model­ing with Sparse Trans­form­ers (Re­won Child et al) (sum­ma­rized by Cody): I see this pa­per as try­ing to in­ter­po­late the space be­tween con­volu­tion (fixed re­cep­tive field, num­ber of lay­ers needed to gain visi­bil­ity to the whole se­quence grows with se­quence length) and at­ten­tion (visi­bil­ity to the en­tire se­quence at each op­er­a­tion, but n^2 mem­ory and com­pute scal­ing with se­quence length, since each new el­e­ment needs to query and be queried by each other el­e­ment). This is done by cre­at­ing chains of op­er­a­tions that are more effi­cient, and can offer visi­bil­ity to the whole se­quence in k steps rather than k=1 steps, as with nor­mal at­ten­tion. An ex­am­ple of this is one at­ten­tion step that pulls in in­for­ma­tion from the last 7 el­e­ments, and then a sec­ond that pulls in in­for­ma­tion from each 7th el­e­ment back in time (the “ag­gre­ga­tion points” of the first op­er­a­tion).

Cody’s opinion: I find this pa­per re­ally clever and po­ten­tially quite high-im­pact, since Trans­form­ers are so widely used, and this pa­per could offer a sub­stan­tial speedup with­out much the­o­ret­i­cal loss of in­for­ma­tion. I also just en­joyed hav­ing to think more about the trade-offs be­tween con­volu­tions, RNNs, and trans­form­ers, and how to get ac­cess to differ­ent points along those trade­off curves.

In­tro­duc­ing Trans­la­totron: An End-to-End Speech-to-Speech Trans­la­tion Model (Ye Jia et al): This post in­tro­duces Trans­la­totron, a sys­tem that takes speech (not text!) in one lan­guage and trans­lates it to an­other lan­guage. This is in con­trast to most cur­rent “cas­caded” sys­tems, which typ­i­cally go from speech to text, then trans­late to the other lan­guage, and then go back from text to speech. While Trans­la­totron doesn’t beat cur­rent sys­tems, it demon­strates the fea­si­bil­ity of this ap­proach.

Ro­hin’s opinion: Ma­chine trans­la­tion used to be done in mul­ti­ple stages (in­volv­ing parse trees as an in­ter­me­di­ate rep­re­sen­ta­tion), and then it was done bet­ter us­ing end-to-end train­ing of a deep neu­ral net. This looks like the be­gin­ning of the same pro­cess for speech-to-speech trans­la­tion. I’m not sure how much peo­ple care about speech-to-speech trans­la­tion, but if it’s an im­por­tant prob­lem, I’d ex­pect the di­rect speech-to-speech sys­tems to out­perform the cas­caded ap­proach rel­a­tively soon. I’m par­tic­u­larly in­ter­ested to see whether you can “boot­strap” by us­ing the cas­caded ap­proach to gen­er­ate train­ing data for the end-to-end ap­proach, and then fine­tune the end-to-end ap­proach on the di­rect speech-to-speech data that’s available to im­prove perfor­mance fur­ther.

A Recipe for Train­ing Neu­ral Net­works (An­drej Karpa­thy): This is a great post de­tailing how to train neu­ral net­works in prac­tice when you want to do any­thing more com­pli­cated than train­ing the most com­mon ar­chi­tec­ture on the most com­mon dataset. For all of you read­ers who are train­ing neu­ral nets, I strongly recom­mend this post; the rea­son I’m not sum­ma­riz­ing it in depth is be­cause a) it would be a re­ally long sum­mary and b) it’s not that re­lated to AI al­ign­ment.

Meta learning

Meta-learn­ers’ learn­ing dy­nam­ics are un­like learn­ers’ (Neil C. Rabinow­itz) (sum­ma­rized by Cody): We’ve seen ev­i­dence in prior work that meta learn­ing mod­els can be trained to more quickly learn tasks drawn from some task dis­tri­bu­tion, by train­ing a model in the in­ner loop and op­ti­miz­ing against gen­er­al­iza­tion er­ror. This pa­per sug­gests that meta learn­ing doesn’t just learn new tasks faster, but has a differ­ent or­dered pat­tern of how it mas­ters the task. Where a “nor­mal” learner first learns the low-fre­quency modes (think SGD modes, or Fourier modes) of a sim­ple reg­u­lariza­tion task, and later the high-fre­quency ones, the meta learner makes progress on all the modes at the same rel­a­tive rate. This meta learn­ing be­hav­ior seems to the­o­ret­i­cally match the way a learner would up­date on new in­for­ma­tion if it had the “cor­rect” prior (i.e. the one ac­tu­ally used to gen­er­ate the simu­lated tasks).

Cody’s opinion: Over­all I like this pa­per’s sim­plic­ity and fo­cus on un­der­stand­ing how meta learn­ing sys­tems work. I did find the re­in­force­ment learn­ing ex­per­i­ment a bit more difficult to parse and con­nect to the lin­ear and non­lin­ear re­gres­sion ex­per­i­ments, and, of course, there’s always the ques­tion with work on sim­pler prob­lems like this of whether the in­tu­ition ex­tends to more com­plex ones

Read more: Cody’s longer summary

Hier­ar­chi­cal RL

Mul­ti­task Soft Op­tion Learn­ing (Max­i­m­il­ian Igl et al) (sum­ma­rized by Cody): This pa­per is a mix of vari­a­tional in­fer­ence and hi­er­ar­chi­cal re­in­force­ment learn­ing, in the con­text of learn­ing skills that can be reused across tasks. In­stead of learn­ing a fixed set of op­tions (read: skills/​sub­poli­cies), and a mas­ter task-spe­cific policy to switch be­tween them, this method learns cross-task pri­ors for each skill, and then learns a task-spe­cific pos­te­rior us­ing re­ward sig­nal from the task, but reg­u­larized to­wards the prior. The hope is that this will al­low for an in­ter­me­di­ary be­tween cross-task trans­fer and sin­gle-task speci­fic­ity.

Cody’s opinion: I found this pa­per in­ter­est­ing, but also found it a bit tricky/​un­in­tu­itive to read, since it used a differ­ent RL frame than I’m used to (the idea of min­i­miz­ing the KL di­ver­gence be­tween your tra­jec­tory dis­tri­bu­tion and the op­ti­mal tra­jec­tory dis­tri­bu­tion). Over­all, seems like a rea­son­able method, but is a bit hard to in­tu­itively tell how strong the the­o­ret­i­cal ad­van­tages are on these rel­a­tively sim­ple tasks.