Align­ment News­let­ter #23

Highlights

Visual Rein­force­ment Learn­ing with Ima­gined Goals (Vitchyr Pong and Ashvin Nair): This is a blog post ex­plain­ing a pa­per by the same name that I covered in AN #16. It’s par­tic­u­larly clear and well-ex­plained, and I con­tinue to think the idea is cool and in­ter­est­ing. I’ve re­copied my sum­mary and opin­ion here, but you should read the blog post, it ex­plains it very well.

Hind­sight Ex­per­i­ence Re­play (HER) in­tro­duced the idea of ac­cel­er­at­ing learn­ing with sparse re­wards, by tak­ing tra­ject­or­ies where you fail to achieve the goal (and so get no re­ward, and thus no learn­ing sig­nal) and re­pla­cing the ac­tual goal with an “ima­gined” goal chosen in hind­sight such that you ac­tu­ally achieved that goal, which means you get re­ward and can learn. This re­quires that you have a space of goals such that for any tra­ject­ory, you can come up with a goal such that the tra­ject­ory achieves that goal. In prac­tice, this means that you are lim­ited to tasks where the goals are of the form “reach this goal state”. However, if your goal state is an im­age, it is very hard to learn how to act in or­der to reach any pos­sible im­age goal state (even if you re­strict to real­istic ones), since the space is so large and un­struc­tured. The au­thors pro­pose to first learn a struc­tured lat­ent rep­res­ent­a­tion of the space of im­ages us­ing a vari­ational au­toen­coder (VAE), and then use that struc­tured lat­ent space as the space of goals which can be achieved. They also use Q-learn­ing in­stead of DDPG (which is what HER used), so that they can ima­gine any goal with a min­i­batch (s, a, s’) and learn from it (whereas HER/​DDPG is lim­ited to states on the tra­ject­ory).

My opin­ion: This is a cool ex­ample of a re­l­at­ively simple yet power­ful idea—in­stead of hav­ing a goal space over all states, learn a good lat­ent rep­res­ent­a­tion and use that as your goal space. This en­ables un­su­per­vised learn­ing in or­der to fig­ure out how to use a ro­bot to gen­er­ally af­fect the world, prob­ably sim­il­arly to how ba­bies ex­plore and learn.

Im­pact Meas­ure Desid­erata (TurnTrout): This post gives a long list of de­sid­erata that we might want an im­pact meas­ure to sat­isfy. It con­siders the case where the im­pact meas­ure is a second level of safety, that is sup­posed to pro­tect us if we don’t suc­ceed at value align­ment. This means that we want our im­pact meas­ure to be ag­nostic to hu­man val­ues. We’d also like it to be ag­nostic to goals, en­vir­on­ments, and rep­res­ent­a­tions of the en­vir­on­ment. There are sev­eral other de­sid­erata—read the post for more de­tails, my sum­mary would just be re­peat­ing it.

My opin­ion: These seem like gen­er­ally good de­sid­erata, though I don’t know how to form­al­ize them to the point that we can ac­tu­ally check with reas­on­able cer­tainty whether a pro­posed im­pact meas­ure meets these de­sid­erata.

I have one ad­di­tional de­sid­er­atum from im­pact meas­ures. The im­pact meas­ure alone should dis­al­low all ex­tinc­tion scen­arios, while still al­low­ing the AI sys­tem to do most of the things we use AI for today. This is rather weak, really I’d want AI do more tasks than are done today. However, even in this weak form, I doubt that we can sat­isfy this de­sid­er­atum if we must also be ag­nostic to val­ues, goals, rep­res­ent­a­tions and en­vir­on­ments. We could have val­ued hu­man su­peri­or­ity at game-play­ing very highly, in which case build­ing AlphaGo would be cata­strophic. How can an im­pact meas­ure al­low that without be­ing at least some know­ledge about val­ues?

Re­cur­rent World Models Fa­cil­it­ate Policy Evolu­tion (David Ha et al): I read the in­ter­act­ive ver­sion of the pa­per. The ba­sic idea is to do model-based re­in­force­ment learn­ing, where the model is com­posed of a vari­ational auto-en­coder that turns a high-di­men­sional state of pixels into a low-di­men­sional rep­res­ent­a­tion, and a large RNN that pre­dicts how the (low-di­men­sional) state will evolve in the fu­ture. The out­puts of this model are fed into a very simple lin­ear con­trol­ler that chooses ac­tions. Since the con­trol­ler is so simple, they can train it us­ing a black box op­tim­iz­a­tion method (an evol­u­tion­ary strategy) that doesn’t re­quire any gradi­ent in­form­a­tion. They eval­u­ate on a ra­cing task and on Doom, and set new state-of-the-art res­ults. There are also other in­ter­est­ing setups—for ex­ample, once you have a world model, you can train the con­trol­ler com­pletely within the world model without in­ter­act­ing with the out­side world at all (us­ing the num­ber of timesteps be­fore the epis­ode ends as your re­ward func­tion, since the world model doesn’t pre­dict stand­ard re­wards, but does pre­dict whether the epis­ode ends). There are a lot of cool visu­al­iz­a­tions that let you play with the mod­els trained with their method.

My opin­ion: I agree with Shi­mon Whiteson’s take, which is that this method gets im­prove­ments by cre­at­ing a sep­ar­a­tion of con­cerns between mod­el­ling the world and learn­ing a con­trol­ler for the model, and eval­u­at­ing on en­vir­on­ments where this sep­ar­a­tion mostly holds. A ma­jor chal­lenge in RL is learn­ing the fea­tures that are im­port­ant for the task un­der con­sid­er­a­tion, and this method in­stead learns fea­tures that al­low you to re­con­struct the state, which could be very dif­fer­ent, but hap­pen to not be dif­fer­ent in their en­vir­on­ments. That said, I really like the present­a­tion of the pa­per and the fact that they did ab­la­tion stud­ies.

Pre­vi­ous newsletters

Model Re­con­struc­tion from Model Ex­plan­a­tions (Smitha Milli et al): Back in AN #16, I said that one way to pre­vent model re­con­struc­tion from gradi­ent-based ex­plan­a­tions was to add noise to the gradi­ents. Smitha poin­ted out that the ex­per­i­ments with SmoothGrad are ac­tu­ally of this form, and it still is pos­sible to re­cover the full model, so even adding noise may not help. I don’t really un­der­stand SmoothGrad and it’s re­la­tion­ship with noise (which is chosen to make a sa­li­ency map look nice, if I un­der­stand cor­rectly) so I don’t know ex­actly what to think here.

Tech­nical AI alignment

Agent foundations

When wish­ful think­ing works (Alex Mennen): So­me­times be­liefs can be loopy, in that the prob­ab­il­ity of a be­lief be­ing true de­pends on whether you be­lieve it. For ex­ample, the prob­ab­il­ity that a placebo helps you may de­pend on whether you be­lieve that a placebo helps you. In the situ­ation where you know this, you can “wish” your be­liefs to be the most use­ful pos­sible be­liefs. In the case where the “true prob­ab­il­ity” de­pends con­tinu­ously on your be­liefs, you can use a fixed point the­orem to find a con­sist­ent set of prob­ab­il­it­ies. There may be many such fixed points, in which case you can choose the one that would lead to highest ex­pec­ted util­ity (such as choos­ing to be­lieve in the placebo). One par­tic­u­lar ap­plic­a­tion of this would be to think of the pro­pos­i­tions as “you will take ac­tion a_i”. In this case, you act the way you be­lieve you act, and then every prob­ab­il­ity dis­tri­bu­tion over the pro­pos­i­tions is a fixed point, and so we just choose the prob­ab­il­ity dis­tri­bu­tion (i.e. stochastic policy) that max­im­ized ex­pec­ted util­ity, as usual. This ana­lysis can also be car­ried to Nash equi­lib­ria, where be­liefs in what ac­tions you take will af­fect the ac­tions that the other player takes.

Coun­ter­fac­tu­als and re­flect­ive or­acles (Nisan)

Learn­ing hu­man intent

Cycle-of-Learn­ing for Autonom­ous Sys­tems from Hu­man In­ter­ac­tion (Nich­olas R. Waytowich et al): We’ve de­veloped many tech­niques for learn­ing be­ha­vi­ors from hu­mans in the last few years. This pa­per cat­egor­izes them as learn­ing from demon­stra­tions (think im­it­a­tion learn­ing and IRL), learn­ing from in­ter­ven­tion (think Safe RL via Hu­man In­ter­ven­tion), and learn­ing from eval­u­ation (think Deep RL from Hu­man Prefer­ences). They pro­pose run­ning these tech­niques in se­quence, fol­lowed by pure RL, to train a full sys­tem. In­tu­it­ively, demon­stra­tions are used to jump­start the learn­ing, get­ting to near-hu­man per­form­ance, and then in­ter­ven­tion and eval­u­ation based learn­ing al­low the sys­tem to safely im­prove bey­ond hu­man-level, since it can learn be­ha­vi­ors that hu­mans can’t per­form them­selves but can re­cog­nize as good, and then RL is used to im­prove even more.

My opin­ion: The gen­eral idea makes sense, but I wish they had ac­tu­ally im­ple­men­ted it and seen how it worked. (They do want to test in ro­bot­ics in fu­ture work.) For ex­ample, they talk about in­fer­ring a re­ward with IRL from demon­stra­tions, and then up­dat­ing it dur­ing the in­ter­ven­tion and eval­u­ation stages. How are they plan­ning to up­date it? Does the format of the re­ward func­tion have to be the same in all stages, and will that af­fect how well each method works?

This feels like a single point in the space of pos­sible designs, and doesn’t in­clude all of the tech­niques I’d be in­ter­ested in. What about act­ive meth­ods, com­bined with ex­plor­a­tion meth­ods in RL? Per­haps you could start with a hand-spe­cified re­ward func­tion, get a prior us­ing in­verse re­ward design, start op­tim­iz­ing it us­ing RL with curi­os­ity, and have a hu­man either in­ter­vene when ne­ces­sary (if you want safe ex­plor­a­tion) or have the RL sys­tem act­ively query the hu­man at cer­tain states, where the hu­man can re­spond with demon­stra­tions or eval­u­ations.

Sample-Ef­fi­cient Imit­a­tion Learn­ing via Gen­er­at­ive Ad­versarial Nets (Lionel Blondé et al)

A Roadmap for the Value-Load­ing Prob­lem (Lê Nguyên Hoang)

Pre­vent­ing bad behavior

Im­pact Meas­ure Desid­erata (TurnTrout): Sum­mar­ized in the high­lights!

Hand­ling groups of agents

Rein­force­ment Learn­ing un­der Threats (Víctor Gal­lego et al): Due to lack of time, I only skimmed this pa­per for 5 minutes, but my gen­eral sense is that it takes MDPs and turns them into two player games by pos­it­ing the pres­ence of an ad­versary. It mod­i­fies the Bell­man up­date equa­tions to handle the ad­versary, but runs into the usual prob­lems of sim­u­lat­ing an ad­versary that sim­u­lates you. So, it form­al­izes level-k think­ing (sim­u­lat­ing an op­pon­ent that thinks about you at level k-1), and eval­u­ates this on mat­rix games and the friend-or-foe en­vir­on­ment from AI safety grid­worlds.

My opin­ion: I’m not sure what this is adding over two-player game the­ory (for which we can com­pute equi­lib­ria) but again I only skimmed the pa­per so it’s quite likely that I missed some­thing.

Near-term concerns

Ad­versarial examples

Ad­versarial Re­pro­gram­ming of Sequence Clas­si­fic­a­tion Neural Net­works (Paarth Neekhara et al)

Fair­ness and bias

In­tro­du­cing the In­clus­ive Images Com­pet­i­tion (Tulsee Doshi): The au­thors write, “this com­pet­i­tion chal­lenges you to use Open Images, a large, mul­til­a­bel, pub­licly-avail­able im­age clas­si­fic­a­tion data­set that is ma­jor­ity-sampled from North Amer­ica and Europe, to train a model that will be eval­u­ated on im­ages col­lec­ted from a dif­fer­ent set of geo­graphic re­gions across the globe”. The res­ults will be presen­ted at NIPS 2018 in Decem­ber.

My opin­ion: I’m really in­ter­ested in the tech­niques and res­ults here, since there’s a clear, sharp dis­tri­bu­tion shift from the train­ing set to the test set, which is al­ways hard to deal with. Hope­fully some of the entries will have gen­eral solu­tions which we can ad­apt to other set­tings.

AI strategy and policy

Pod­cast: Ar­ti­fi­cial In­tel­li­gence – Global Governance, Na­tional Policy, and Public Trust with Al­lan Da­foe and Jes­sica Cuss­ins (Al­lan Da­foe, Jes­sica Cuss­ins, and Ariel Conn): Top­ics dis­cussed in­clude the dif­fer­ence between AI gov­ernance and AI policy, ex­tern­al­it­ies and solv­ing them through reg­u­la­tion, whether gov­ern­ments and bur­eau­cra­cies can keep up with AI re­search, the ex­tent to which the US’ policy of not reg­u­lat­ing AI may cause cit­izens to lose trust, labor dis­place­ment and in­equal­ity, and AI races.

Other pro­gress in AI

Rein­force­ment learning

Visual Rein­force­ment Learn­ing with Ima­gined Goals (Vitchyr Pong and Ashvin Nair): Sum­mar­ized in the high­lights!

Re­cur­rent World Models Fa­cil­it­ate Policy Evolu­tion (David Ha et al): Sum­mar­ized in the high­lights!

ARCHER: Ag­gress­ive Re­wards to Counter bias in Hind­sight Ex­per­i­ence Re­play (Sameera Lanka et al)

SOLAR: Deep Struc­tured Lat­ent Re­p­res­ent­a­tions for Model-Based Rein­force­ment Learn­ing (Marvin Zhang, Sharad Vikram et al)

Ex­pIt-OOS: Towards Learn­ing from Plan­ning in Im­per­fect In­form­a­tion Games (Andy Kitchen et al)

Mis­cel­laneous (AI)

Mak­ing it easier to dis­cover data­sets (Nata­sha Noy): Google has launched Data­set Search, a tool that lets you search for data­sets that you could then use in re­search.

My opin­ion: I ima­gine that this is primar­ily tar­geted at data sci­ent­ists aim­ing to learn about the real world, and not ML re­search­ers, but I wouldn’t be sur­prised if it was help­ful for us as well. MNIST and ImageNet are both present, and a search for “self-driv­ing cars” turned up some prom­ising-look­ing links that I didn’t in­vest­ig­ate fur­ther.