Alignment Newsletter #35

Find all Align­ment Newslet­ter re­sources here. In par­tic­u­lar, you can sign up, or look through this spread­sheet of all sum­maries that have ever been in the newslet­ter.

This week we don’t have any ex­plicit high­lights, but re­mem­ber to treat the se­quences as though they were high­lighted!

Tech­ni­cal AI alignment

Iter­ated am­plifi­ca­tion sequence

Cor­rigi­bil­ity (Paul Chris­ti­ano): A cor­rigible agent is one which helps its op­er­a­tor, even with tasks that would change the agent it­self, such as cor­rect­ing mis­takes in AI de­sign. Con­sider a good act-based agent, which chooses ac­tions ac­cord­ing to our prefer­ences over that ac­tion. Since we have a short-term prefer­ence for cor­rigi­bil­ity, the act-based agent should be cor­rigible. For ex­am­ple, if we are try­ing to turn off the agent, the agent will turn off be­cause that’s what we would pre­fer—it is easy to in­fer that the over­seer would not pre­fer that agents stop the over­seer from shut­ting them down. Typ­i­cally we only be­lieve that the agent would stop us from shut­ting it down if it makes long-term plans, in which case be­ing op­er­a­tional is in­stru­men­tally use­ful, but with act-based agents the agent only op­ti­mizes for its over­seer’s short term prefer­ences. One po­ten­tial ob­jec­tion is that the no­tion of cor­rigi­bil­ity is not easy to learn, but it seems not that hard to an­swer the ques­tion “Is the op­er­a­tor be­ing mis­led”, and in any case we can try this with sim­ple sys­tems, and the re­sults should im­prove with more ca­pa­ble sys­tems, since as you get smarter you are more ca­pa­ble of pre­dict­ing the over­seer.

In ad­di­tion, even if an agent has a slightly wrong no­tion of the over­seer’s val­ues, it seems like it will im­prove over time. It is not hard to in­fer that the over­seer wants the agent to make its ap­prox­i­ma­tion of the over­seer’s val­ues more ac­cu­rate. So, as long as the agent has enough of the over­seer’s prefer­ences to be cor­rigible, it will try to learn about the prefer­ences it is wrong about and will be­come more and more al­igned over time. In ad­di­tion, any slight value drifts caused by eg. am­plifi­ca­tion will tend to be fixed over time, at least on av­er­age.

Ro­hin’s opinion: I re­ally like this for­mu­la­tion of cor­rigi­bil­ity, which I find quite differ­ent from MIRI’s pa­per. This seems a lot more in line with the kind of rea­son­ing that I want from an AI sys­tem, and it seems like iter­ated am­plifi­ca­tion or some­thing like it could plau­si­bly suc­ceed at achiev­ing this sort of cor­rigible be­hav­ior.

Iter­ated Distil­la­tion and Am­plifi­ca­tion (Ajeya Co­tra): This is the first in a se­ries of four posts de­scribing the iter­ated am­plifi­ca­tion frame­work in differ­ent ways. This post fo­cuses on the rep­e­ti­tion of two steps. In am­plifi­ca­tion, we take a fast al­igned agent and turn it into a slow but more ca­pa­ble al­igned agent, by al­low­ing a hu­man to co­or­di­nate many copies of the fast agent in or­der to make bet­ter de­ci­sions. In dis­til­la­tion, we take a slow al­igned agent and turn it a fast al­igned agent (per­haps by train­ing a neu­ral net to imi­tate the judg­ments of the slow agent). This is similar to AlphaGoZero, in which MCTS can be thought of as am­plifi­ca­tion, while dis­til­la­tion con­sists of up­dat­ing the neu­ral net to pre­dict the out­puts of the MCTS.

This al­lows us to get both al­ign­ment and pow­er­ful ca­pa­bil­ities, whereas usu­ally the two trade off against each other. High ca­pa­bil­ities im­plies a suffi­ciently broad man­date to search for good be­hav­iors, al­low­ing our AI sys­tems to find novel be­hav­iors that we never would have thought of, which could be bad if the ob­jec­tive was slightly wrong. On the other hand, high al­ign­ment typ­i­cally re­quires stay­ing within the realm of hu­man be­hav­ior, as in imi­ta­tion learn­ing, which pre­vents the AI from find­ing novel solu­tions.

In ad­di­tion to dis­til­la­tion and am­plifi­ca­tion ro­bustly pre­serv­ing al­ign­ment, we also need to en­sure that given a hu­man as a start­ing point, iter­ated dis­til­la­tion and am­plifi­ca­tion can scale to ar­bi­trary ca­pa­bil­ities. We would also want it be about as cost-effi­cient as al­ter­na­tives. This seems to be true at test time, when we are sim­ply ex­e­cut­ing a learned model, but it could be that train­ing is much more ex­pen­sive.

Ro­hin’s opinion: This is a great sim­ple ex­pla­na­tion of the scheme. I don’t have much to say about the idea since I’ve talked about iter­ated am­plifi­ca­tion so much in this newslet­ter already.

Benign model-free RL (Paul Chris­ti­ano): This post is very similar to the pre­vi­ous one, just with differ­ent lan­guage: dis­til­la­tion is now im­ple­mented through re­ward mod­el­ing with ro­bust­ness. The point of ro­bust­ness is to en­sure that the dis­til­led agent is be­nign even out­side of the train­ing dis­tri­bu­tion (though it can be in­com­pe­tent). There’s also an anal­y­sis of the costs of the scheme. One im­por­tant note is that this ap­proach only works for model-free RL sys­tems—we’ll need some­thing else for eg. model-based RL, if it en­ables ca­pa­bil­ities that we can’t get with model-free RL.

Value learn­ing sequence

In­tu­itions about goal-di­rected be­hav­ior and Co­her­ence ar­gu­ments do not im­ply goal-di­rected be­hav­ior (Ro­hin Shah) (sum­ma­rized by Richard): Ro­hin dis­cusses the “mis­speci­fied goal ar­gu­ment for AI risk”: that even a small mis­speci­fi­ca­tion in goals can lead to ad­ver­sar­ial be­havi­our in ad­vanced AI. He ar­gues that whether be­havi­our is goal-di­rected de­pends on whether it gen­er­al­ises to new situ­a­tions in ways that are pre­dictable given that goal. He also raises the pos­si­bil­ity that think­ing of an agent as goal-di­rected be­comes less use­ful the more we un­der­stand about how it works. If true, this would weaken the mis­speci­fied goal ar­gu­ment.

In the next post, Ro­hin ar­gues against the claim that “sim­ply know­ing that an agent is in­tel­li­gent lets us in­fer that it is goal-di­rected”. He points out that all be­havi­our can be ra­tio­nal­ized as ex­pected util­ity max­imi­sa­tion over world-his­to­ries—but this may not meet our crite­ria for goal-di­rected be­havi­our, and slightly mis­spec­i­fy­ing such a util­ity func­tion may well be perfectly safe. What’s more in­ter­est­ing—and dan­ger­ous—is ex­pected util­ity max­imi­sa­tion over world-states—but he claims that we shouldn’t as­sume that ad­vanced AI will have this sort of util­ity func­tion, un­less we have ad­di­tional in­for­ma­tion (e.g. that it has a util­ity func­tion sim­ple enough to be ex­plic­itly rep­re­sented). There are plenty of in­tel­li­gent agents which aren’t goal-di­rected—e.g. ones which are very good at in­fer­ence but only take triv­ial ac­tions.

Richard’s opinion: I broadly agree with Ro­hin’s points in these posts, and am glad that he’s mak­ing these ar­gu­ments ex­plicit. How­ever, while goal-di­rect­ed­ness is a tricky prop­erty to rea­son about, I think it’s still use­ful to con­sider it a prop­erty of an agent rather than a prop­erty of our model of that agent. It’s true that when we have a de­tailed ex­pla­na­tion of how an agent works, we’re able to think of cases in which its goal-di­rect­ed­ness breaks down (e.g. ad­ver­sar­ial ex­am­ples). How­ever, when these ex­am­ples are very rare, they don’t make much prac­ti­cal differ­ence (e.g. know­ing that AlphaGo has a blind spot in cer­tain endgames might not be very helpful in beat­ing it, be­cause you can’t get to those endgames).

Agent foundations

Ro­bust pro­gram equil­ibrium (Cas­par Oester­held)

Bounded Or­a­cle In­duc­tion (Diffrac­tor)

Or­a­cle In­duc­tion Proofs (Diffrac­tor)

Learn­ing hu­man intent

Guid­ing Poli­cies with Lan­guage via Meta-Learn­ing (John D. Co-Reyes) (sum­ma­rized by Richard): The au­thors train an agent to perform tasks speci­fied in nat­u­ral lan­guage, with a “cor­rec­tion” af­ter each at­tempt (also in nat­u­ral lan­guage). They for­mu­late this as a meta-learn­ing prob­lem: for each in­struc­tion, sev­eral at­tempt-cor­rec­tion cy­cles are al­lowed. Each at­tempt takes into ac­count pre­vi­ous at­tempts to achieve the same in­struc­tion by pass­ing each pre­vi­ous tra­jec­tory and its cor­re­spond­ing cor­rec­tion through a CNN, then us­ing the mean of all out­puts as an in­put to a policy mod­ule.

In their ex­per­i­ments, all in­struc­tions and cor­rec­tions are gen­er­ated au­to­mat­i­cally, and test-time perfor­mance is eval­u­ated as a func­tion of how many cor­rec­tions are al­lowed. In one ex­per­i­ment, the tasks is to nav­i­gate rooms to reach a goal, where the cor­rec­tion is the next sub­goal re­quired. Given 4 cor­rec­tions, their agent out­performs a baseline which was given all 5 sub­goals at the be­gin­ning of the task. In an­other ex­per­i­ment, the task is to move a block to an am­bigu­ously-speci­fied lo­ca­tion, and the cor­rec­tions nar­row down the tar­get area; their trained agent scores 0.9, as op­posed to 0.96 for an agent given the ex­act tar­get lo­ca­tion.

Richard’s opinion: This pa­per ex­plores an im­por­tant idea: cor­rect­ing poorly-speci­fied in­struc­tions us­ing hu­man-in-the-loop feed­back. The sec­ond task in par­tic­u­lar is a nice toy ex­am­ple of iter­a­tive prefer­ence clar­ifi­ca­tion. I’m not sure whether their meta-learn­ing ap­proach is di­rectly rele­vant to safety, par­tic­u­larly be­cause each cor­rec­tion is only “in scope” for a sin­gle epi­sode, and also only oc­curs af­ter a bad at­tempt has finished. How­ever, the broad idea of cor­rec­tion-based learn­ing seems promis­ing.


Deeper In­ter­pretabil­ity of Deep Net­works (Tian Xu et al)

GAN Dis­sec­tion: Vi­su­al­iz­ing and Un­der­stand­ing Gen­er­a­tive Ad­ver­sar­ial Net­works (David Bau et al)

Please Stop Ex­plain­ing Black Box Models for High Stakes De­ci­sions (Cyn­thia Rudin)

Rep­re­sen­ter Point Selec­tion for Ex­plain­ing Deep Neu­ral Net­works (Chih-Kuan Yeh, Joon Sik Kim et al)

Ad­ver­sar­ial examples

Ro­bust­ness via cur­va­ture reg­u­lariza­tion, and vice versa (Moosavi-Dezfooli et al) (sum­ma­rized by Dan H): This pa­per pro­poses a dis­tinct way to in­crease ad­ver­sar­ial per­tur­ba­tion ro­bust­ness. They take an ad­ver­sar­ial ex­am­ple gen­er­ated with the FGSM, com­pute the gra­di­ent of the loss for the clean ex­am­ple and the gra­di­ent of the loss for the ad­ver­sar­ial ex­am­ple, and they pe­nal­ize this differ­ence. De­creas­ing this penalty re­lates to de­creas­ing the loss sur­face cur­va­ture. The tech­nique works slightly worse than ad­ver­sar­ial train­ing.


Train­able Cal­ibra­tion Mea­sures For Neu­ral Net­works From Ker­nel Mean Embed­dings (Aviral Ku­mar et al)


How rapidly are GPUs im­prov­ing in price perfor­mance? (gal­labytes)

Time for AI to cross the hu­man perfor­mance range in di­a­betic retinopa­thy (Aysja John­son)

Near-term concerns

Fair­ness and bias

50 Years of Test (Un)fair­ness: Les­sons for Ma­chine Learn­ing (Ben Hutch­in­son)

AI strat­egy and policy

Ro­bust Ar­tifi­cial In­tel­li­gence and Ro­bust Hu­man Or­ga­ni­za­tions (Thomas G. Diet­ter­ich)

Hand­ful of Coun­tries – In­clud­ing the US and Rus­sia – Ham­per Dis­cus­sions to Ban Killer Robots at UN

Other progress in AI


Mon­tezuma’s Re­venge Solved by Go-Ex­plore, a New Al­gorithm for Hard-Ex­plo­ra­tion Prob­lems (Adrien Ecoffet et al) (sum­ma­rized by Richard): This blog post show­cases an agent which achieves high scores in Mon­tezuma’s Re­venge and Pit­fall by keep­ing track of a fron­tier of vis­ited states (and the tra­jec­to­ries which led to them). In each train­ing epi­sode, a state is cho­sen from the fron­tier, the en­vi­ron­ment is re­set to that state, and then the agent ran­domly ex­plores fur­ther and up­dates the fron­tier. The au­thors ar­gue that this ad­dresses the ten­dency of in­trin­sic mo­ti­va­tion al­gorithms to for­get about promis­ing ar­eas they’ve already ex­plored. To make state stor­age tractable, each state is stored as a down­sam­pled 11x8 image.

The au­thors note that this solu­tion ex­ploits the de­ter­minism of the en­vi­ron­ment, which makes it brit­tle. So they then use imi­ta­tion learn­ing to learn a policy from demon­stra­tions by the origi­nal agent. The re­sult­ing agents score many times higher than state-of-the-art on Mon­tezuma’s Re­venge and Pit­fall.

Richard’s opinion: I’m not par­tic­u­larly im­pressed by this re­sult, for a cou­ple of rea­sons. Firstly, I think that ex­ploit­ing de­ter­minism by re­set­ting the en­vi­ron­ment (or even just mem­o­ris­ing tra­jec­to­ries) fun­da­men­tally changes the na­ture of the prob­lem posed by hard Atari games. Do­ing so al­lows us to solve them in the same ways as any other search prob­lem—we could, for in­stance, just use the AlphaZero al­gorithm to train a value net­work. In ad­di­tion, the head­line re­sults are gen­er­ated by hand-en­g­ineer­ing fea­tures like x-y co­or­di­nates and room num­ber, a tech­nique that has been es­chewed by most other at­tempts. When you take those fea­tures away, their agent’s to­tal re­ward on Pit­fall falls back to 0.

Read more: Quick Opinions on Go-Explore

Pri­ori­tiz­ing Start­ing States for Re­in­force­ment Learn­ing (Arash Tavakoli, Vi­taly Levdik et al)

Re­in­force­ment learning

Learn­ing Ac­tion­able Rep­re­sen­ta­tions with Goal-Con­di­tioned Poli­cies (Dibya Ghosh)

Un­su­per­vised Con­trol Through Non-Para­met­ric Discrim­i­na­tive Re­wards (David Warde-Far­ley)

Hier­ar­chi­cal RL

Hier­ar­chi­cal vi­suo­mo­tor con­trol of hu­manoids (Josh Merel, Arun Ahuja et al)