Alignment Newsletter #23


Vi­sual Re­in­force­ment Learn­ing with Imag­ined Goals (Vitchyr Pong and Ashvin Nair): This is a blog post ex­plain­ing a pa­per by the same name that I cov­ered in AN #16. It’s par­tic­u­larly clear and well-ex­plained, and I con­tinue to think the idea is cool and in­ter­est­ing. I’ve re­copied my sum­mary and opinion here, but you should read the blog post, it ex­plains it very well.

Hind­sight Ex­pe­rience Re­play (HER) in­tro­duced the idea of ac­cel­er­at­ing learn­ing with sparse re­wards, by tak­ing tra­jec­to­ries where you fail to achieve the goal (and so get no re­ward, and thus no learn­ing sig­nal) and re­plac­ing the ac­tual goal with an “imag­ined” goal cho­sen in hind­sight such that you ac­tu­ally achieved that goal, which means you get re­ward and can learn. This re­quires that you have a space of goals such that for any tra­jec­tory, you can come up with a goal such that the tra­jec­tory achieves that goal. In prac­tice, this means that you are limited to tasks where the goals are of the form “reach this goal state”. How­ever, if your goal state is an image, it is very hard to learn how to act in or­der to reach any pos­si­ble image goal state (even if you re­strict to re­al­is­tic ones), since the space is so large and un­struc­tured. The au­thors pro­pose to first learn a struc­tured la­tent rep­re­sen­ta­tion of the space of images us­ing a vari­a­tional au­toen­coder (VAE), and then use that struc­tured la­tent space as the space of goals which can be achieved. They also use Q-learn­ing in­stead of DDPG (which is what HER used), so that they can imag­ine any goal with a mini­batch (s, a, s’) and learn from it (whereas HER/​DDPG is limited to states on the tra­jec­tory).

My opinion: This is a cool ex­am­ple of a rel­a­tively sim­ple yet pow­er­ful idea—in­stead of hav­ing a goal space over all states, learn a good la­tent rep­re­sen­ta­tion and use that as your goal space. This en­ables un­su­per­vised learn­ing in or­der to figure out how to use a robot to gen­er­ally af­fect the world, prob­a­bly similarly to how ba­bies ex­plore and learn.

Im­pact Mea­sure Desider­ata (TurnTrout): This post gives a long list of desider­ata that we might want an im­pact mea­sure to satisfy. It con­sid­ers the case where the im­pact mea­sure is a sec­ond level of safety, that is sup­posed to pro­tect us if we don’t suc­ceed at value al­ign­ment. This means that we want our im­pact mea­sure to be ag­nos­tic to hu­man val­ues. We’d also like it to be ag­nos­tic to goals, en­vi­ron­ments, and rep­re­sen­ta­tions of the en­vi­ron­ment. There are sev­eral other desider­ata—read the post for more de­tails, my sum­mary would just be re­peat­ing it.

My opinion: Th­ese seem like gen­er­ally good desider­ata, though I don’t know how to for­mal­ize them to the point that we can ac­tu­ally check with rea­son­able cer­tainty whether a pro­posed im­pact mea­sure meets these desider­ata.

I have one ad­di­tional desider­a­tum from im­pact mea­sures. The im­pact mea­sure alone should dis­al­low all ex­tinc­tion sce­nar­ios, while still al­low­ing the AI sys­tem to do most of the things we use AI for to­day. This is rather weak, re­ally I’d want AI do more tasks than are done to­day. How­ever, even in this weak form, I doubt that we can satisfy this desider­a­tum if we must also be ag­nos­tic to val­ues, goals, rep­re­sen­ta­tions and en­vi­ron­ments. We could have val­ued hu­man su­pe­ri­or­ity at game-play­ing very highly, in which case build­ing AlphaGo would be catas­trophic. How can an im­pact mea­sure al­low that with­out be­ing at least some knowl­edge about val­ues?

Re­cur­rent World Models Fa­cil­i­tate Policy Evolu­tion (David Ha et al): I read the in­ter­ac­tive ver­sion of the pa­per. The ba­sic idea is to do model-based re­in­force­ment learn­ing, where the model is com­posed of a vari­a­tional auto-en­coder that turns a high-di­men­sional state of pix­els into a low-di­men­sional rep­re­sen­ta­tion, and a large RNN that pre­dicts how the (low-di­men­sional) state will evolve in the fu­ture. The out­puts of this model are fed into a very sim­ple lin­ear con­trol­ler that chooses ac­tions. Since the con­trol­ler is so sim­ple, they can train it us­ing a black box op­ti­miza­tion method (an evolu­tion­ary strat­egy) that doesn’t re­quire any gra­di­ent in­for­ma­tion. They eval­u­ate on a rac­ing task and on Doom, and set new state-of-the-art re­sults. There are also other in­ter­est­ing se­tups—for ex­am­ple, once you have a world model, you can train the con­trol­ler com­pletely within the world model with­out in­ter­act­ing with the out­side world at all (us­ing the num­ber of timesteps be­fore the epi­sode ends as your re­ward func­tion, since the world model doesn’t pre­dict stan­dard re­wards, but does pre­dict whether the epi­sode ends). There are a lot of cool vi­su­al­iza­tions that let you play with the mod­els trained with their method.

My opinion: I agree with Shi­mon White­son’s take, which is that this method gets im­prove­ments by cre­at­ing a sep­a­ra­tion of con­cerns be­tween mod­el­ling the world and learn­ing a con­trol­ler for the model, and eval­u­at­ing on en­vi­ron­ments where this sep­a­ra­tion mostly holds. A ma­jor challenge in RL is learn­ing the fea­tures that are im­por­tant for the task un­der con­sid­er­a­tion, and this method in­stead learns fea­tures that al­low you to re­con­struct the state, which could be very differ­ent, but hap­pen to not be differ­ent in their en­vi­ron­ments. That said, I re­ally like the pre­sen­ta­tion of the pa­per and the fact that they did ab­la­tion stud­ies.

Pre­vi­ous newsletters

Model Re­con­struc­tion from Model Ex­pla­na­tions (Smitha Milli et al): Back in AN #16, I said that one way to pre­vent model re­con­struc­tion from gra­di­ent-based ex­pla­na­tions was to add noise to the gra­di­ents. Smitha pointed out that the ex­per­i­ments with SmoothGrad are ac­tu­ally of this form, and it still is pos­si­ble to re­cover the full model, so even adding noise may not help. I don’t re­ally un­der­stand SmoothGrad and it’s re­la­tion­ship with noise (which is cho­sen to make a saliency map look nice, if I un­der­stand cor­rectly) so I don’t know ex­actly what to think here.

Tech­ni­cal AI alignment

Agent foundations

When wish­ful think­ing works (Alex Men­nen): Some­times be­liefs can be loopy, in that the prob­a­bil­ity of a be­lief be­ing true de­pends on whether you be­lieve it. For ex­am­ple, the prob­a­bil­ity that a placebo helps you may de­pend on whether you be­lieve that a placebo helps you. In the situ­a­tion where you know this, you can “wish” your be­liefs to be the most use­ful pos­si­ble be­liefs. In the case where the “true prob­a­bil­ity” de­pends con­tin­u­ously on your be­liefs, you can use a fixed point the­o­rem to find a con­sis­tent set of prob­a­bil­ities. There may be many such fixed points, in which case you can choose the one that would lead to high­est ex­pected util­ity (such as choos­ing to be­lieve in the placebo). One par­tic­u­lar ap­pli­ca­tion of this would be to think of the propo­si­tions as “you will take ac­tion a_i”. In this case, you act the way you be­lieve you act, and then ev­ery prob­a­bil­ity dis­tri­bu­tion over the propo­si­tions is a fixed point, and so we just choose the prob­a­bil­ity dis­tri­bu­tion (i.e. stochas­tic policy) that max­i­mized ex­pected util­ity, as usual. This anal­y­sis can also be car­ried to Nash equil­ibria, where be­liefs in what ac­tions you take will af­fect the ac­tions that the other player takes.

Coun­ter­fac­tu­als and re­flec­tive or­a­cles (Nisan)

Learn­ing hu­man intent

Cy­cle-of-Learn­ing for Au­tonomous Sys­tems from Hu­man In­ter­ac­tion (Ni­cholas R. Way­tow­ich et al): We’ve de­vel­oped many tech­niques for learn­ing be­hav­iors from hu­mans in the last few years. This pa­per cat­e­go­rizes them as learn­ing from demon­stra­tions (think imi­ta­tion learn­ing and IRL), learn­ing from in­ter­ven­tion (think Safe RL via Hu­man In­ter­ven­tion), and learn­ing from eval­u­a­tion (think Deep RL from Hu­man Prefer­ences). They pro­pose run­ning these tech­niques in se­quence, fol­lowed by pure RL, to train a full sys­tem. In­tu­itively, demon­stra­tions are used to jump­start the learn­ing, get­ting to near-hu­man perfor­mance, and then in­ter­ven­tion and eval­u­a­tion based learn­ing al­low the sys­tem to safely im­prove be­yond hu­man-level, since it can learn be­hav­iors that hu­mans can’t perform them­selves but can rec­og­nize as good, and then RL is used to im­prove even more.

My opinion: The gen­eral idea makes sense, but I wish they had ac­tu­ally im­ple­mented it and seen how it worked. (They do want to test in robotics in fu­ture work.) For ex­am­ple, they talk about in­fer­ring a re­ward with IRL from demon­stra­tions, and then up­dat­ing it dur­ing the in­ter­ven­tion and eval­u­a­tion stages. How are they plan­ning to up­date it? Does the for­mat of the re­ward func­tion have to be the same in all stages, and will that af­fect how well each method works?

This feels like a sin­gle point in the space of pos­si­ble de­signs, and doesn’t in­clude all of the tech­niques I’d be in­ter­ested in. What about ac­tive meth­ods, com­bined with ex­plo­ra­tion meth­ods in RL? Per­haps you could start with a hand-speci­fied re­ward func­tion, get a prior us­ing in­verse re­ward de­sign, start op­ti­miz­ing it us­ing RL with cu­ri­os­ity, and have a hu­man ei­ther in­ter­vene when nec­es­sary (if you want safe ex­plo­ra­tion) or have the RL sys­tem ac­tively query the hu­man at cer­tain states, where the hu­man can re­spond with demon­stra­tions or eval­u­a­tions.

Sam­ple-Effi­cient Imi­ta­tion Learn­ing via Gen­er­a­tive Ad­ver­sar­ial Nets (Lionel Blondé et al)

A Roadmap for the Value-Load­ing Prob­lem (Lê Nguyên Hoang)

Prevent­ing bad behavior

Im­pact Mea­sure Desider­ata (TurnTrout): Sum­ma­rized in the high­lights!

Han­dling groups of agents

Re­in­force­ment Learn­ing un­der Threats (Víc­tor Gallego et al): Due to lack of time, I only skimmed this pa­per for 5 min­utes, but my gen­eral sense is that it takes MDPs and turns them into two player games by posit­ing the pres­ence of an ad­ver­sary. It mod­ifies the Bel­l­man up­date equa­tions to han­dle the ad­ver­sary, but runs into the usual prob­lems of simu­lat­ing an ad­ver­sary that simu­lates you. So, it for­mal­izes level-k think­ing (simu­lat­ing an op­po­nent that thinks about you at level k-1), and eval­u­ates this on ma­trix games and the friend-or-foe en­vi­ron­ment from AI safety grid­wor­lds.

My opinion: I’m not sure what this is adding over two-player game the­ory (for which we can com­pute equil­ibria) but again I only skimmed the pa­per so it’s quite likely that I missed some­thing.

Near-term concerns

Ad­ver­sar­ial examples

Ad­ver­sar­ial Re­pro­gram­ming of Se­quence Clas­sifi­ca­tion Neu­ral Net­works (Paarth Neekhara et al)

Fair­ness and bias

In­tro­duc­ing the In­clu­sive Images Com­pe­ti­tion (Tulsee Doshi): The au­thors write, “this com­pe­ti­tion challenges you to use Open Images, a large, mul­ti­la­bel, pub­li­cly-available image clas­sifi­ca­tion dataset that is ma­jor­ity-sam­pled from North Amer­ica and Europe, to train a model that will be eval­u­ated on images col­lected from a differ­ent set of ge­o­graphic re­gions across the globe”. The re­sults will be pre­sented at NIPS 2018 in De­cem­ber.

My opinion: I’m re­ally in­ter­ested in the tech­niques and re­sults here, since there’s a clear, sharp dis­tri­bu­tion shift from the train­ing set to the test set, which is always hard to deal with. Hope­fully some of the en­tries will have gen­eral solu­tions which we can adapt to other set­tings.

AI strat­egy and policy

Pod­cast: Ar­tifi­cial In­tel­li­gence – Global Gover­nance, Na­tional Policy, and Public Trust with Allan Dafoe and Jes­sica Cuss­ins (Allan Dafoe, Jes­sica Cuss­ins, and Ariel Conn): Topics dis­cussed in­clude the differ­ence be­tween AI gov­er­nance and AI policy, ex­ter­nal­ities and solv­ing them through reg­u­la­tion, whether gov­ern­ments and bu­reau­cra­cies can keep up with AI re­search, the ex­tent to which the US’ policy of not reg­u­lat­ing AI may cause cit­i­zens to lose trust, la­bor dis­place­ment and in­equal­ity, and AI races.

Other progress in AI

Re­in­force­ment learning

Vi­sual Re­in­force­ment Learn­ing with Imag­ined Goals (Vitchyr Pong and Ashvin Nair): Sum­ma­rized in the high­lights!

Re­cur­rent World Models Fa­cil­i­tate Policy Evolu­tion (David Ha et al): Sum­ma­rized in the high­lights!

ARCHER: Ag­gres­sive Re­wards to Counter bias in Hind­sight Ex­pe­rience Re­play (Sameera Lanka et al)

SOLAR: Deep Struc­tured La­tent Rep­re­sen­ta­tions for Model-Based Re­in­force­ment Learn­ing (Marvin Zhang, Sharad Vikram et al)

Ex­pIt-OOS: Towards Learn­ing from Plan­ning in Im­perfect In­for­ma­tion Games (Andy Kitchen et al)

Mis­cel­la­neous (AI)

Mak­ing it eas­ier to dis­cover datasets (Natasha Noy): Google has launched Dataset Search, a tool that lets you search for datasets that you could then use in re­search.

My opinion: I imag­ine that this is pri­mar­ily tar­geted at data sci­en­tists aiming to learn about the real world, and not ML re­searchers, but I wouldn’t be sur­prised if it was helpful for us as well. MNIST and ImageNet are both pre­sent, and a search for “self-driv­ing cars” turned up some promis­ing-look­ing links that I didn’t in­ves­ti­gate fur­ther.

No comments.