Alignment Newsletter #13: 07/​02/​18


OpenAI Five (Many peo­ple at OpenAI): OpenAI has trained a team of five neu­ral net­works to play a par­tic­u­lar set of Dota heroes in a mir­ror match (play­ing against the same set of heroes) with a few re­stric­tions, and have started to beat am­a­teur hu­man play­ers. They are aiming to beat a team of top pro­fes­sion­als at The In­ter­na­tional in Au­gust, with the same set of five heroes, but with­out any other re­stric­tions. Salient points:

  • The method is re­mark­ably sim­ple—it’s a scaled up ver­sion of PPO with train­ing data com­ing from self-play, with re­ward shap­ing and some heuris­tics for ex­plo­ra­tion, where each agent is im­ple­mented by an LSTM.

  • There’s no hu­man data apart from the re­ward shap­ing and ex­plo­ra­tion heuris­tics.

  • Con­trary to most ex­pec­ta­tions, they didn’t need any­thing fun­da­men­tally new in or­der to get long-term strate­gic plan­ning. I was par­tic­u­larly sur­prised by this. Some in­ter­est­ing thoughts from OpenAI re­searchers in this thread—in par­tic­u­lar, as­sum­ing good ex­plo­ra­tion, the var­i­ance of the gra­di­ent should scale lin­early with the du­ra­tion, and so you might ex­pect you only need lin­early more sam­ples to coun­ter­act this.

  • They used 256 ded­i­cated GPUs and 128,000 pre­emptible CPUs. A Hacker News com­ment es­ti­mates the cost at $2500 per hour, which would put the likely to­tal cost in the mil­lions of dol­lars.

  • They simu­late 900 years of Dota ev­ery day, which is a ra­tio of ~330,000:1, sug­gest­ing that each CPU is run­ning Dota ~2.6x faster than real time. In re­al­ity, it’s prob­a­bly run­ning many times faster than that, but pre­emp­tions, com­mu­ni­ca­tion costs, syn­chro­niza­tion etc. all lead to in­effi­ciency.

  • There was no ex­plicit com­mu­ni­ca­tion mechanism be­tween agents, but they all get to ob­serve the full Dota 2 state (not pix­els) that any of the agents could ob­serve, so com­mu­ni­ca­tion is not re­ally nec­es­sary.

  • A ver­sion of the code with a se­ri­ous bug was still able to train to beat hu­mans. Not en­courag­ing for safety.

  • Alex Ir­pan cov­ers some of these points in more depth in Quick Opinions on OpenAI Five.

  • Gw­ern com­ments as well.

My opinion: I might be more ex­cited by an ap­proach that was able to learn from hu­man games (which are plen­tiful), and per­haps fine­tune with RL, in or­der to de­velop an ap­proach that could gen­er­al­ize to more tasks in the fu­ture, where hu­man data is available but a simu­la­tor is not. (Given the ridicu­lous sam­ple com­plex­ity, pure RL with PPO can only be used in tasks with a simu­la­tor.) On the other hand, an ap­proach that lev­er­aged hu­man data would nec­es­sar­ily be at least some­what spe­cific to Dota. A de­pen­dence on hu­man data is un­likely to get us to gen­eral in­tel­li­gence, whereas this re­sult sug­gests that we can solve tasks that have a simu­la­tor, ex­plo­ra­tion strat­egy, and a dense re­ward func­tion, which re­ally is push­ing the bound­ary on gen­er­al­ity. This seems to be gdb’s take: “We are very en­couraged by the al­gorith­mic im­pli­ca­tion of this re­sult — in fact, it mir­rors closely the story of deep learn­ing (ex­ist­ing al­gorithms at large scale solve oth­er­wise un­solv­able prob­lems). If you have a very hard prob­lem for which you have a simu­la­tor, our re­sults im­ply there is a real, prac­ti­cal path to­wards solv­ing it. This still needs to be proven out in real-world do­mains, but it will be very in­ter­est­ing to see the full ram­ifi­ca­tions of this find­ing.”

Paul’s re­search agenda FAQ (zhu­keepa): Ex­actly what it sounds like. I’m not go­ing to sum­ma­rize it be­cause it’s long and cov­ers a lot of stuff, but I do recom­mend it.

Tech­ni­cal AI alignment

Tech­ni­cal agen­das and prioritization

Con­cep­tual is­sues in AI safety: the paradig­matic gap (Jon Gau­thier): Lots of cur­rent work on AI safety fo­cuses on what we can call “mid-term safety”—the safety of AI sys­tems that are more pow­er­ful and more broadly de­ployed than the ones we have to­day, but work us­ing rel­a­tively similar tech­niques as the ones we use to­day. How­ever, it seems plau­si­ble that there will be a paradigm shift in how we build AI sys­tems, and if so it’s likely that we will have a new, com­pletely differ­ent set of mid-term con­cerns, ren­der­ing the pre­vi­ous mid-term work use­less. For ex­am­ple, at the end of the 19th cen­tury, horse ex­cre­ment was a huge pub­lic health haz­ard, and “mid-term safety” would likely have been about how to re­move the ex­cre­ment. In­stead, the au­to­mo­bile was de­vel­oped and started re­plac­ing horses, lead­ing to new set of mid-term con­cerns (eg. pol­lu­tion, traf­fic ac­ci­dents), and any pre­vi­ous work on re­mov­ing horse ex­cre­ment be­came near-use­less.

My opinion: I fo­cus al­most ex­clu­sively on mid-term safety (while think­ing about long-term safety), not be­cause I dis­agree with this ar­gu­ment, but in spite of it. I think there is a good chance that any work I do will be use­less for al­ign­ing su­per­in­tel­li­gent AI be­cause of a paradigm shift, but I do it any­way be­cause it seems very im­por­tant on short timelines, which are eas­ier to af­fect; and I don’t know of other ap­proaches to take that would have a sig­nifi­cantly higher prob­a­bil­ity of be­ing use­ful for al­ign­ing su­per­in­tel­li­gent AI.

Read more: A pos­si­ble stance for AI con­trol research

Op­ti­miza­tion Am­plifies (Scott Garrabrant): One model of the differ­ence be­tween math­e­mat­i­ci­ans and sci­en­tists is that a sci­en­tist is good at dis­t­in­guish­ing be­tween 0.01%, 50% and 99.99%, whereas a math­e­mat­i­cian is good at dis­t­in­guish­ing be­tween 99.99% and 100%. Cer­tainly it seems like if we can get 99.99% con­fi­dence that an AI sys­tem is al­igned, we should count that as a huge win, and not hope for more (since the re­main­ing 0.01% is ex­tremely hard to get), so why do we need math­e­mat­i­ci­ans? Scott ar­gues that op­ti­miza­tion is par­tic­u­larly spe­cial, in that the point of very strong op­ti­miza­tion is to hit a very nar­row tar­get, which severely af­fects ex­treme prob­a­bil­ities, mov­ing them from 0.01% to near-100%. For ex­am­ple, if you draw a mil­lion sam­ples from a nor­mal dis­tri­bu­tion and op­ti­mize for the largest one, it is al­most cer­tain to be 4 stan­dard de­vi­a­tions above the mean (which is in­cred­ibly un­likely for a ran­domly cho­sen sam­ple). In this sort of set­ting, the deep un­der­stand­ing of a prob­lem that you get from a math­e­mat­i­cian is still im­por­tant. Note that Scott is not say­ing that we don’t need sci­en­tists, nor that we should aim for 100% cer­tainty that an AI is al­igned.

My opinion: I think I agree with this post? Cer­tainly for a su­per­in­tel­li­gence that is vastly smarter than hu­mans, I buy this ar­gu­ment (and in gen­eral am not op­ti­mistic about solv­ing al­ign­ment). How­ever, hu­mans seem to be fairly good at keep­ing each other in check, with­out a deep un­der­stand­ing of what makes hu­mans tick, even though hu­mans of­ten do op­ti­mize against each other. Per­haps we can main­tain this situ­a­tion in­duc­tively as our AI sys­tems get more pow­er­ful, with­out re­quiring a deep un­der­stand­ing of what’s go­ing on? Over­all I’m pretty con­fused on this point.

Another take on agent foun­da­tions: for­mal­iz­ing zero-shot rea­son­ing (zhu­keepa): There are strong in­cen­tives to build a re­cur­sively self-im­prov­ing AI, and in or­der to do this with­out value drift, the AI needs to be able to rea­son effec­tively about the na­ture of changes it makes to it­self. In such sce­nar­ios, it is in­suffi­cient to “rea­son with ex­treme cau­tion”, where you think re­ally hard about the pro­posed change, and im­ple­ment it if you can’t find rea­sons not to do it. In­stead, you need to do some­thing like “zero-shot rea­son­ing”, where you prove un­der some rea­son­able as­sump­tions that the pro­posed change is good. This sort of rea­son­ing must be very pow­er­ful, en­abling the AI to eg. build a space­craft that lands on Mars, af­ter ob­serv­ing Earth for one day. This mo­ti­vates many of the prob­lems in MIRI’s agenda, such as Vingean re­flec­tion (self-trust), log­i­cal un­cer­tainty (how to han­dle be­ing a bounded rea­soner), coun­ter­fac­tu­als, etc., which all help to for­mal­ize zero-shot rea­son­ing.

My opinion: This as­sumes an on­tol­ogy where there ex­ists a util­ity func­tion that an AI is op­ti­miz­ing, and changes to the AI seem es­pe­cially likely to change the util­ity func­tion in a ran­dom di­rec­tion. In such a sce­nario, yes, you prob­a­bly should be wor­ried. How­ever, in prac­tice, I ex­pect that pow­er­ful AI sys­tems will not look like they are ex­plic­itly max­i­miz­ing some util­ity func­tion. If you change some com­po­nent of the sys­tem for the worse, you are likely to de­grade its perfor­mance, but not likely to dras­ti­cally change its be­hav­ior to cause hu­man ex­tinc­tion. For ex­am­ple, even in RL (which is the clos­est thing to ex­pected util­ity max­i­miza­tion), you can have se­ri­ous bugs and still do rel­a­tively well on the ob­jec­tive. A pub­lic ex­am­ple of this is in OpenAI Five (https://​blog.ope­​ope­nai-five/​), but I also hear this ex­pressed when talk­ing to RL re­searchers (and see this my­self). While you still want to be very care­ful with self-mod­ifi­ca­tion, it seems gen­er­ally fine not to have a for­mal proof be­fore mak­ing the change, and eval­u­at­ing the change af­ter it has taken place. (This would fail dra­mat­i­cally if the change dras­ti­cally changed be­hav­ior, but if it only de­grades perfor­mance, I ex­pect the AI would still be com­pe­tent enough to no­tice and undo the change.) It may be the case that ad­ver­sar­ial sub­pro­cesses could take ad­van­tage of these sorts of bugs, but I ex­pect that we need ad­ver­sar­ial-sub­pro­cess-spe­cific re­search to ad­dress this, not zero-shot rea­son­ing.

The Learn­ing-The­o­retic AI Align­ment Re­search Agenda (Vadim Kosoy): This agenda aims to cre­ate a gen­eral ab­stract the­ory of in­tel­li­gence (in a man­ner similar to AIXI, but with some defi­cien­cies re­moved). In par­tic­u­lar, once we use the frame­work of re­in­force­ment learn­ing, re­gret bounds are a par­tic­u­lar way of prov­ably quan­tify­ing an agent’s in­tel­li­gence (though there may be other ways as well). Once we have this the­ory, we can ground all other AI al­ign­ment prob­lems within it. Speci­fi­cally, al­ign­ment would be for­mal­ized as a value learn­ing pro­to­col that achieves some re­gret bound. With this for­mal­iza­tion, we can solve hard metaphilos­o­phy prob­lems such as “What is im­perfect ra­tio­nal­ity?” through the in­tu­itions gained from look­ing at the prob­lem through the lens of value learn­ing pro­to­cols and uni­ver­sal re­in­force­ment learn­ing.

My opinion: This agenda, like oth­ers, is mo­ti­vated by the sce­nario where we need to get al­ign­ment right the first time, with­out em­piri­cal feed­back loops, both be­cause we might be fac­ing one-shot suc­cess or failure, and be­cause the stakes are so high that we should aim for high re­li­a­bil­ity sub­ject to time con­straints. I put low prob­a­bil­ity on the first rea­son (al­ign­ment be­ing one-shot), and it seems much less tractable, so I mostly ig­nore those sce­nar­ios. I agree with the sec­ond rea­son, but aiming for this level of rigor seems like it will take much longer than the time we ac­tu­ally have. Given this high level dis­agree­ment, it’s hard for me to eval­u­ate the re­search agenda it­self.

Iter­ated dis­til­la­tion and amplification

Paul’s re­search agenda FAQ (zhu­keepa): Sum­ma­rized in the high­lights!

Agent foundations

Fore­cast­ing us­ing in­com­plete mod­els (Vadim Kosoy)

Log­i­cal un­cer­tainty and Math­e­mat­i­cal un­cer­tainty (Alex Men­nen)

Learn­ing hu­man intent

Policy Ap­proval (Abram Dem­ski): Ar­gues that even if we had the true hu­man util­ity func­tion (as­sum­ing it ex­ists), an AI that op­ti­mizes it would still not be al­igned. It also sketches out an idea for learn­ing poli­cies in­stead of util­ity func­tions that gets around these is­sues.

My opinion: I dis­agree with the post but most likely I don’t un­der­stand it. My straw­man of the post is that it is ar­gu­ing for imi­ta­tion learn­ing in­stead of in­verse re­in­force­ment learn­ing (which differ when the AI and hu­man know differ­ent things), which seems wrong to me.

Hu­man-In­ter­ac­tive Sub­goal Su­per­vi­sion for Effi­cient In­verse Re­in­force­ment Learn­ing (Xin­lei Pan et al)

Multi-agent In­verse Re­in­force­ment Learn­ing for Gen­eral-sum Stochas­tic Games (Xiaomin Lin et al)

Ad­ver­sar­ial Ex­plo­ra­tion Strat­egy for Self-Su­per­vised Imi­ta­tion Learn­ing (Zhang-Wei Hong et al)

Prevent­ing bad behavior

Min­i­max-Re­gret Query­ing on Side Effects for Safe Op­ti­mal­ity in Fac­tored Markov De­ci­sion Pro­cesses (Shun Zhang et al): As we saw in Align­ment Newslet­ter #11, one ap­proach to avoid­ing side effects is to cre­ate a whitelist of effects that are al­lowed. In this pa­per, the agent learns both a whitelist of al­lowed effects, and a black­list of dis­al­lowed effects. They as­sume that the MDP in which the agent is act­ing has been fac­tored into a set of fea­tures that can take on differ­ent val­ues, and then sep­a­rate the fea­tures as locked (un­change­able), free (change­able), or un­known. If there are no un­known fea­tures, then we can calcu­late the op­ti­mal policy us­ing var­i­ants of stan­dard tech­niques (for ex­am­ple, by chang­ing the tran­si­tion func­tion to re­move tran­si­tions that would change locked fea­tures, and then run­ning any off-the-shelf MDP solver). How­ever, this would re­quire the op­er­a­tor to la­bel all fea­tures as locked or un­locked, which would be very te­dious. To solve this, they al­low the agent to query the op­er­a­tor whether a cer­tain fea­ture is locked or un­locked, and provide al­gorithms that re­duce the num­ber of queries that the agent needs to make in or­der to find an op­ti­mal safe policy.

My opinion: This seems like a good first step to­wards whitelist­ing—there’s still a lot of hard­coded knowl­edge from a hu­man (which fea­tures to pay at­ten­tion to, the tran­si­tion func­tion) and re­stric­tions (the num­ber of rele­vant fea­tures needs to be small), but it takes a prob­lem and pro­vides a solu­tion that works in that set­ting. In the re­cent whitelist­ing ap­proach, I was wor­ried that the whitelist sim­ply wouldn’t in­clude enough tran­si­tions for the agent to be able to do any­thing use­ful. Since this ap­proach ac­tively queries the op­er­a­tor un­til it finds a safe policy, that is no longer an is­sue. How­ever, the cor­re­spond­ing worry would be that it takes pro­hibitively many queries be­fore the agent can do any­thing use­ful. (Their em­piri­cal eval­u­a­tion is on toy grid­wor­lds, so this prob­lem did not come up.) Another worry pre­vi­ously was that whitelist­ing causes an agent to be “clingy”, that is, it wants to pre­vent all changes to non-whitelisted fea­tures, even if they are caused by phys­i­cal laws, or other hu­mans. A similar prob­lem could arise here when this is gen­er­al­ized to dy­namic and/​or mul­ti­a­gent en­vi­ron­ments.

Read more: Wor­ry­ing about the Vase: Whitelisting

Han­dling groups of agents

Learn­ing So­cial Con­ven­tions in Markov Games (Adam Lerer and Alexan­der Peysakhovich)


Open the Black Box Data-Driven Ex­pla­na­tion of Black Box De­ci­sion Sys­tems (Dino Pe­dreschi et al)

In­ter­pretable Dis­cov­ery in Large Image Data Sets (Kiri L. Wagstaff et al)

Near-term concerns

Ad­ver­sar­ial examples

On Ad­ver­sar­ial Ex­am­ples for Char­ac­ter-Level Neu­ral Ma­chine Trans­la­tion (Javid Ebrahimi et al)

AI capabilities

Re­in­force­ment learning

OpenAI Five (Many peo­ple at OpenAI): Sum­ma­rized in the high­lights!

Retro Con­test: Re­sults (John Schul­man et al): OpenAI has an­nounced the re­sults of the Retro Con­test. The win­ning sub­mis­sions were mod­ified ver­sions of ex­ist­ing al­gorithms like joint PPO and Rain­bow, with­out any Sonic-spe­cific parts.

A Tour of Re­in­force­ment Learn­ing: The View from Con­tin­u­ous Con­trol (Ben­jamin Recht)

Evolv­ing sim­ple pro­grams for play­ing Atari games (Den­nis G Wil­son et al)

Ac­cu­racy-based Cur­ricu­lum Learn­ing in Deep Re­in­force­ment Learn­ing (Pierre Fournier et al)

Deep learning

DARTS: Differ­en­tiable Ar­chi­tec­ture Search (Hanx­iao Liu et al)

Re­source-Effi­cient Neu­ral Ar­chi­tect (Yanqi Zhou et al)

AGI theory

The Foun­da­tions of Deep Learn­ing with a Path Towards Gen­eral In­tel­li­gence (Eray Özku­ral)


RAISE sta­tus re­port April-June 2018 (Veerle)