Alignment Newsletter #18


Learn­ing Dex­ter­ity (Many peo­ple at OpenAI): Most cur­rent ex­per­i­ments with robotics work on rel­a­tively small state spaces (think 7 de­grees of free­dom, each a real num­ber) and are trained in simu­la­tion. If we could throw a lot of com­pute at the prob­lem, could we do sig­nifi­cantly bet­ter? Yes! Us­ing the same gen­eral ap­proach as with OpenAI Five, OpenAI has built a sys­tem called Dactyl, which al­lows a phys­i­cal real-world dex­ter­ous hand to ma­nipu­late a block. It may not seem as im­pres­sive as the videos of hu­manoids run­ning through ob­sta­cle courses, but this is way harder than your typ­i­cal Mu­joco en­vi­ron­ment, es­pe­cially since they aim to get it work­ing on a real robot. As with OpenAI Five, they only need a re­ward func­tion (I be­lieve not even a shaped re­ward func­tion in this case), a simu­la­tor, and a good way to ex­plore. In this set­ting though, “ex­plo­ra­tion” is ac­tu­ally do­main ran­dom­iza­tion, where you ran­domly set pa­ram­e­ters that you are un­cer­tain about (such as the co­effi­cient of fric­tion be­tween two sur­faces), so that the learned policy is ro­bust to dis­tri­bu­tion shift from the simu­la­tor to the real world. (OpenAI Five also used do­main ran­dom­iza­tion, but in that case it was not be­cause we were un­cer­tain about the pa­ram­e­ters in the simu­la­tor, but be­cause the policy was too spe­cial­ized to the kinds of char­ac­ters and heroes it was see­ing, and ran­dom­iz­ing those prop­er­ties ex­posed it to a wider va­ri­ety of sce­nar­ios so it had to learn more gen­eral poli­cies.) They use 6144 CPU cores and 8 GPUs, which is much less than for OpenAI Five, but much more than for a typ­i­cal Mu­joco en­vi­ron­ment.

They do sep­a­rate the prob­lem into two pieces—first, they learn how to map from cam­era pic­tures to a 3D pose (us­ing con­volu­tional nets), and sec­ond, they use RL to choose ac­tions based on the 3D pose. They can also get bet­ter es­ti­mates of the 3D pose us­ing mo­tion track­ing. They find that the CNN is al­most as good as mo­tion track­ing, and that the do­main ran­dom­iza­tion is cru­cial for get­ting the sys­tem to ac­tu­ally work.

They also have a cou­ple of sec­tions on sur­pris­ing re­sults and things that didn’t work. Prob­a­bly the most in­ter­est­ing part was that they didn’t need to use the tac­tile sen­sors to get these re­sults. They couldn’t get these sen­sors in simu­la­tion, so they just did with­out and it seems to have worked fine. It also turns out that the robot’s re­ac­tion time wasn’t too im­por­tant—there wasn’t a big differ­ence in chang­ing from 80ms re­ac­tion time to 40ms re­ac­tion time; in fact, this just in­creased the re­quired train­ing time with­out much benefit.

Prob­a­bly the most in­ter­est­ing part of the post is the last para­graph (ital­ics in­di­cates my notes): “This pro­ject com­pletes a full cy­cle of AI de­vel­op­ment that OpenAI has been pur­su­ing for the past two years: we’ve de­vel­oped a new learn­ing al­gorithm (PPO), scaled it mas­sively to solve hard simu­lated tasks (OpenAI Five), and then ap­plied the re­sult­ing sys­tem to the real world (this post). Re­peat­ing this cy­cle at in­creas­ing scale is the pri­mary route we are pur­su­ing to in­crease the ca­pa­bil­ities of to­day’s AI sys­tems to­wards safe ar­tifi­cial gen­eral in­tel­li­gence.”

My opinion: This is pretty ex­cit­ing—trans­fer­ring a policy from simu­la­tion to the real world is no­to­ri­ously hard, but it turns out that as long as you use do­main ran­dom­iza­tion (and 30x the com­pute) it ac­tu­ally is pos­si­ble to trans­fer the policy. I wish they had com­pared the suc­cess prob­a­bil­ity in simu­la­tion to the suc­cess prob­a­bil­ity in the real world—right now I don’t know how well the policy trans­ferred. (That is, I want to eval­u­ate how well do­main ran­dom­iza­tion solved the dis­tri­bu­tion shift prob­lem.) Lots of other ex­cit­ing things too, but they are pretty similar to the ex­cit­ing things about OpenAI Five, such as the abil­ity to learn higher level strate­gies like finger pivot­ing and slid­ing (analo­gously, fight­ing over mid or 5-man push).

Vari­a­tional Op­tion Dis­cov­ery Al­gorithms (Joshua Achiam et al): We can hope to do hi­er­ar­chi­cal re­in­force­ment learn­ing by first dis­cov­er­ing sev­eral use­ful sim­ple poli­cies (or “op­tions”) by just act­ing in the en­vi­ron­ment with­out any re­ward func­tion, and then us­ing these op­tions as prim­i­tive ac­tions in a higher level policy that learns to do some task (us­ing a re­ward func­tion). How could we learn the op­tions with­out a re­ward func­tion though? In­tu­itively, we would like to learn be­hav­iors that are differ­ent from each other. One way to frame this would be to think of this as an en­coder-de­coder prob­lem. Sup­pose we want to learn K op­tions. Then, we can give the en­coder a num­ber in the range [1, K], have it “en­code” the num­ber into a tra­jec­tory τ (that is, our en­coder is a policy), and then have a de­coder take τ and re­cover the origi­nal num­ber. We train the en­coder/​policy and de­coder jointly, op­ti­miz­ing them to suc­cess­fully re­cover the origi­nal num­ber (called a con­text). In­tu­itively, the en­coder/​policy wants to have very differ­ent be­hav­iors for each op­tion, so that it easy for de­coder to figure out the con­text from the tra­jec­tory τ. How­ever, a sim­ple solu­tion would be for the en­coder/​policy to just take a par­tic­u­lar se­ries of ac­tions for each con­text and then stop, and the de­coder learns an ex­act map­ping from fi­nal states to con­texts. To avoid this, we can de­crease the ca­pac­ity of the de­coder (i.e. don’t give it too many lay­ers), and we also op­ti­mize for the en­tropy of the en­coder/​policy, which en­courages the en­coder/​policy to be more stochas­tic, and so it is more likely to learn over­all be­hav­iors that can still have some stochas­tic­ity, while still al­low­ing the de­coder to de­code them. It turns out that this op­ti­miza­tion prob­lem has a one-to-one cor­re­spon­dence with vari­a­tional au­toen­coders, mo­ti­vat­ing the name “vari­a­tional op­tion dis­cov­ery”. To sta­bi­lize train­ing, they start with a small K, and in­crease K when­ever the de­coder be­comes pow­er­ful enough. They eval­u­ate in Gym en­vi­ron­ments, a simu­lated robotic hand, and a new “Tod­dler” en­vi­ron­ment. They find that the scheme works well (in terms of max­i­miz­ing the ob­jec­tive) in all en­vi­ron­ments, but that the learned be­hav­iors no longer look nat­u­ral in the Tod­dler en­vi­ron­ment (which is the most com­plex). They also show that the learned poli­cies can be used for hi­er­ar­chi­cal RL in the An­tMaze prob­lem.

This is very similar to the re­cent Diver­sity Is All You Need. DIAYN aims to de­code the con­text from ev­ery state along a tra­jec­tory, which in­cen­tivizes it to find be­hav­iors of the form “go to a goal state”, whereas VALOR (this work) de­codes the con­text from the en­tire tra­jec­tory (with­out ac­tions, which would make the de­coder’s job too easy), which al­lows it to learn be­hav­iors with mo­tion, such as “go around in a cir­cle”.

My opinion: It’s re­ally re­fresh­ing to read a pa­per with a nega­tive re­sult about their own method (speci­fi­cally, that the learned be­hav­iors on Tod­dler do not look nat­u­ral). It makes me trust the rest of their pa­per so much more. (A very game­able in­stinct, I know.) While they were able to find a fairly di­verse set of op­tions, and could in­ter­po­late be­tween them, their ex­per­i­ments found that us­ing this for hi­er­ar­chi­cal RL was about as good as train­ing hi­er­ar­chi­cal RL from scratch. I guess I’m just say­ing things they’ve already said—I think they’ve done such a great job writ­ing this pa­per that they’ve already told me what my opinion about the topic should be, so there’s not much left for me to say.

Tech­ni­cal AI alignment


A Gym Grid­world En­vi­ron­ment for the Treach­er­ous Turn (Michaël Trazzi): An ex­am­ple Gym en­vi­ron­ment in which the agent starts out “weak” (hav­ing an in­ac­cu­rate bow) and later be­comes “strong” (get­ting a bow with perfect ac­cu­racy), af­ter which the agent un­der­takes a treach­er­ous turn in or­der to kill the su­per­vi­sor and wire­head.

My opinion: I’m a fan of ex­e­cutable code that demon­strates the prob­lems that we are wor­ry­ing about—it makes the con­cept (in this case, a treach­er­ous turn) more con­crete. In or­der to make it more re­al­is­tic, I would want the agent to grow in ca­pa­bil­ity or­gan­i­cally (rather than sim­ply get­ting a more pow­er­ful weapon). It would re­ally drive home the point if the agent un­der­took a treach­er­ous turn the very first time, whereas in this post I as­sume it learned us­ing many epi­sodes of trial-and-er­ror that a treach­er­ous turn leads to higher re­ward. This seems hard to demon­strate with to­day’s ML in any com­plex en­vi­ron­ment, where you need to learn from ex­pe­rience in­stead of us­ing eg. value iter­a­tion, but it’s not out of the ques­tion in a con­tinual learn­ing setup where the agent can learn a model of the world.

Agent foundations

Coun­ter­fac­tu­als, thick and thin (Nisan): There are many differ­ent ways to for­mal­ize coun­ter­fac­tu­als (the post sug­gests three such ways). Often, for any given way of for­mal­iz­ing coun­ter­fac­tu­als, there are many ways you could take a coun­ter­fac­tual, which give differ­ent an­swers. When con­sid­er­ing the phys­i­cal world, we have strong causal mod­els that can tell us which one is the “cor­rect” coun­ter­fac­tual. How­ever, there is no such method for log­i­cal coun­ter­fac­tu­als yet.

My opinion: I don’t think I un­der­stood this post, so I’ll ab­stain on an opinion.

De­ci­sions are not about chang­ing the world, they are about learn­ing what world you live in (shminux): The post tries to rec­on­cile de­ci­sion the­ory (in which agents can “choose” ac­tions) with the de­ter­minis­tic phys­i­cal world (in which noth­ing can be “cho­sen”), us­ing many ex­am­ples from de­ci­sion the­ory.

Han­dling groups of agents

Multi-Agent Gen­er­a­tive Ad­ver­sar­ial Imi­ta­tion Learn­ing (Ji­am­ing Song et al): This pa­per gen­er­al­izes GAIL (which was cov­ered last week) to the mul­ti­a­gent set­ting, where we want to imi­tate a group of in­ter­act­ing agents. They want to find a Nash equil­ibrium in par­tic­u­lar. They for­mal­ize the Nash equil­ibrium con­straints and use this to mo­ti­vate a par­tic­u­lar op­ti­miza­tion prob­lem for mul­ti­a­gent IRL, that looks very similar to their op­ti­miza­tion prob­lem for reg­u­lar IRL in GAIL. After that, it is quite similar to GAIL—they use a reg­u­larizer ψ for the re­ward func­tions, show that the com­po­si­tion of mul­ti­a­gent RL and mul­ti­a­gent IRL can be solved as a sin­gle op­ti­miza­tion prob­lem in­volv­ing the con­vex con­ju­gate of ψ, and pro­pose a par­tic­u­lar in­stan­ti­a­tion of ψ that is data-de­pen­dent, giv­ing an al­gorithm. They do have to as­sume in the the­ory that the mul­ti­a­gent RL prob­lem has a unique solu­tion, which is not typ­i­cally true, but may not be too im­por­tant. As be­fore, to make the al­gorithm prac­ti­cal, they struc­ture it like a GAN, with dis­crim­i­na­tors act­ing like re­ward func­tions. What if we have prior in­for­ma­tion that the game is co­op­er­a­tive or com­pet­i­tive? In this case, they pro­pose chang­ing the reg­u­larizer ψ, mak­ing it keep all the re­ward func­tions the same (if co­op­er­a­tive), mak­ing them nega­tions of each other (in two-player zero-sum games), or leav­ing it as is. They eval­u­ate in a va­ri­ety of sim­ple mul­ti­a­gent games, as well as a plank en­vi­ron­ment in which the en­vi­ron­ment changes be­tween train­ing and test time, thus re­quiring the agent to learn a ro­bust policy, and find that the cor­rect var­i­ant of MAGAIL (co­op­er­a­tive/​com­pet­i­tive/​nei­ther) out­performs both be­hav­ioral clon­ing and sin­gle-agent GAIL (which they run N times to in­fer a sep­a­rate re­ward for each agent).

My opinion: Mul­ti­a­gent set­tings seem very im­por­tant (since there does hap­pen to be more than one hu­man in the world). This looks like a use­ful gen­er­al­iza­tion from the sin­gle agent case to the mul­ti­a­gent case, though it’s not clear to me that this deals with the ma­jor challenges that come from mul­ti­a­gent sce­nar­ios. One ma­jor challenge is that there is no longer a sin­gle op­ti­mal equil­ibrium when there are mul­ti­ple agents, but they sim­ply as­sume in their the­o­ret­i­cal anal­y­sis that there is only one solu­tion. Another one is that it seems more im­por­tant that the poli­cies take his­tory into ac­count some­how, but they don’t do this. (If you don’t take his­tory into ac­count, then you can’t learn strate­gies like tit-for-tat in the iter­ated pris­oner’s dilemma.) But to be clear I think this is the stan­dard setup for mul­ti­a­gent RL—it seems like field is not try­ing to deal with this is­sue yet (even though they could us­ing eg. a re­cur­rent policy, I think?)

Mis­cel­la­neous (Align­ment)

Safely and use­fully spec­tat­ing on AIs op­ti­miz­ing over toy wor­lds (Alex Men­nen): One way to achieve safety would be to build an AI that op­ti­mizes in a vir­tual world run­ning on a com­puter, and doesn’t care about the phys­i­cal world. Even if it re­al­izes that it can break out and eg. get more com­pute, these sorts of changes to the phys­i­cal world would not be helpful for the pur­pose of op­ti­miz­ing the ab­stract com­pu­ta­tional ob­ject that is the vir­tual world. How­ever, if we take the re­sults of the AI and build them in the real world, that causes a dis­tri­bu­tional shift from the toy world to the real world that could be catas­trophic. For ex­am­ple, if the AI cre­ated an­other agent in the toy world that did rea­son­able things in the toy world, when we bring it to the real world it may re­al­ize that it can in­stead ma­nipu­late hu­mans in or­der to do things.

My opinion: It’s not ob­vi­ous to me, even on the “op­ti­miz­ing an ab­stract com­pu­ta­tional pro­cess” model, why an AI would not want get more com­pute—it can use this com­pute for it­self, with­out chang­ing the ab­stract com­pu­ta­tional pro­cess it is op­ti­miz­ing, and it will prob­a­bly do bet­ter. It seems that if you want to get this to work, you need to have the AI want to com­pute the re­sult of run­ning it­self with­out any mod­ifi­ca­tion or ex­tra com­pute on the vir­tual world. This feels very hard to me. Separately, I also find it hard to imag­ine us build­ing a vir­tual world that is similar enough to the real world that we are able to trans­fer solu­tions be­tween the two, even with some fine­tun­ing in the real world.

Sand­box­ing by Phys­i­cal Si­mu­la­tion? (mori­d­i­na­mael)

Near-term concerns

Ad­ver­sar­ial examples

Eval­u­at­ing and Un­der­stand­ing the Ro­bust­ness of Ad­ver­sar­ial Logit Pairing (Lo­gan Engstrom, An­drew Ilyas and Anish Atha­lye)

AI strat­egy and policy

The Facets of Ar­tifi­cial In­tel­li­gence: A Frame­work to Track the Evolu­tion of AI (Fer­nando Martinez-Plumed et al)

Pod­cast: Six Ex­perts Ex­plain the Killer Robots De­bate (Paul Scharre, Toby Walsh, Richard Moyes, Mary Ware­ham, Bon­nie Docherty, Peter Asaro, and Ariel Conn)

AI capabilities

Re­in­force­ment learning

Learn­ing Dex­ter­ity (Many peo­ple at OpenAI): Sum­ma­rized in the high­lights!

Vari­a­tional Op­tion Dis­cov­ery Al­gorithms (Joshua Achiam et al): Sum­ma­rized in the high­lights!

Learn­ing Plannable Rep­re­sen­ta­tions with Causal In­foGAN (Tha­nard Ku­ru­tach, Aviv Ta­mar et al): Hier­ar­chi­cal re­in­force­ment learn­ing aims to learn a hi­er­ar­chy of ac­tions that an agent can take, each im­ple­mented in terms of ac­tions lower in the hi­er­ar­chy, in or­der to get more effi­cient plan­ning. Another way we can achieve this is to use a clas­si­cal plan­ning al­gorithm to find a se­quence of way­points, or states that the agent should reach that will al­low it to reach its goal. Th­ese way­points can be thought of as a high-level plan. You can then use stan­dard RL al­gorithms to figure out how to go from one way­point to the next. How­ever, typ­i­cal plan­ning al­gorithms that can pro­duce a se­quence of way­points re­quire very struc­tured state rep­re­sen­ta­tions, that were de­signed by hu­mans in the past. How can we learn them di­rectly from data? This pa­per pro­poses Causal In­foGAN. They use a GAN where the gen­er­a­tor cre­ates ad­ja­cent way­points in the se­quence, while the dis­crim­i­na­tor tries to dis­t­in­guish be­tween way­points from the gen­er­a­tor and pairs of points sam­pled from the true en­vi­ron­ment. This in­cen­tivizes the gen­er­a­tor to gen­er­ate way­points that are close to each other, so that we can use an RL al­gorithm to learn to go from one way­point to the next. How­ever, this only lets us gen­er­ate ad­ja­cent way­points. In or­der to use this to make a se­quence of way­points that gets from a start state to a goal state, we need to use some clas­si­cal plan­ning al­gorithm. In or­der to do that, we need to have a struc­tured state rep­re­sen­ta­tion. GANs do not do this by de­fault. In­foGAN tries to make the la­tent rep­re­sen­ta­tion in a GAN more mean­ingful by pro­vid­ing the gen­er­a­tor with a “code” (a state in our case) and max­i­miz­ing the mu­tual in­for­ma­tion of the code and the out­put of the gen­er­a­tor. In this set­ting, we want to learn rep­re­sen­ta­tions that are good for plan­ning, so we want to en­code in­for­ma­tion about tran­si­tions be­tween states. This leads to the Causal In­foGAN ob­jec­tive, where we provide the gen­er­a­tor with a pair of ab­stract states (s, s’), have it gen­er­ate a pair of ob­ser­va­tions (o, o’) and max­i­mize the mu­tual in­for­ma­tion be­tween (s, s’) and (o, o’), so that s and s’ be­come good low-di­men­sional rep­re­sen­ta­tions of o and o’. They show that Causal In­foGAN can cre­ate se­quences of way­points in a rope ma­nipu­la­tion task, that pre­vi­ously had to be done man­u­ally.

My opinion: We’re see­ing more and more work com­bin­ing clas­si­cal sym­bolic ap­proaches with the cur­rent wave of statis­ti­cal ma­chine learn­ing from big data, that gives them the best of both wor­lds. While the re­sults we see are not gen­eral in­tel­li­gence, it’s be­com­ing less and less true that you can point to a broad swath of ca­pa­bil­ities that AI can­not do yet. I wouldn’t be sur­prised if a com­bi­na­tion of sym­bolic and stas­ti­cal AI tech­niques led to large ca­pa­bil­ity gains in the next few years.

Deep learning

Ten­sorFuzz: De­bug­ging Neu­ral Net­works with Cover­age-Guided Fuzzing (Au­gus­tus Odena et al)


AI Strat­egy Pro­ject Man­ager (FHI)