The Alignment Newsletter #1: 04/​09/​18


Speci­fi­ca­tion gam­ing ex­am­ples in AI (Vic­to­ria Krakovna): A list of ex­am­ples of speci­fi­ca­tion gam­ing, where an al­gorithm figures out a way to liter­ally satisfy the given speci­fi­ca­tion which does not match the de­signer’s in­tent.

Should you read it? There were sev­eral ex­am­ples I hadn’t heard of be­fore, which were pretty en­ter­tain­ing. Also, if you have any ex­am­ples that aren’t already listed, it would be great to send them via the form so that we can have a canon­i­cal list of speci­fi­ca­tion gam­ing ex­am­ples.

My take on agent foun­da­tions: for­mal­iz­ing metaphilo­soph­i­cal com­pe­tence (Alex Zhu): Ar­gues that the point of Agent Foun­da­tions is to cre­ate con­cep­tual clar­ity for fuzzy con­cepts that we can’t for­mal­ize yet (such as log­i­cal un­cer­tainty). We can then ver­ify whether our ML al­gorithms have these de­sir­able prop­er­ties. It is de­cid­edly not a goal to build a friendly AI us­ing mod­ules that Agent Foun­da­tions de­velop.

Should you read it? I don’t know much about MIRI and Agent Foun­da­tions, but this made sense to me and felt like it clar­ified things for me.

Ad­ver­sar­ial At­tacks and Defences Com­pe­ti­tion (Alexey Ku­rakin et al): This is a re­port on a com­pe­ti­tion held at NIPS 2017 for the best ad­ver­sar­ial at­tacks and defences. It in­cludes a sum­mary of the field and then shows the re­sults from the com­pe­ti­tion.

Should you read it? I’m not very fa­mil­iar with the liter­a­ture on ad­ver­sar­ial ex­am­ples and so I found this very use­ful as an overview of the field, es­pe­cially since it talks about the ad­van­tages and dis­ad­van­tages of differ­ent meth­ods, which are hard to find by read­ing in­di­vi­d­ual pa­pers. The ac­tual com­pe­ti­tion re­sults are also quite in­ter­est­ing—they find that the best at­tacks and defences are both quite suc­cess­ful on av­er­age, but have very bad worst-case perfor­mance (that is, the best defence is still very weak against at least one at­tack, and the best at­tack fails to at­tack at least one defence). Over­all, this paints a bleak pic­ture for defence, at least if the at­tacker has ac­cess to enough com­pute to ac­tu­ally try out differ­ent at­tack meth­ods, and has a way of ver­ify­ing whether the at­tacks suc­ceed.

Tech­ni­cal AI alignment


Speci­fi­ca­tion gam­ing ex­am­ples in AI (Vic­to­ria Krakovna): Sum­ma­rized in the high­lights!

Me­taphilo­soph­i­cal com­pe­tence can’t be dis­en­tan­gled from al­ign­ment (Alex Zhu): Would you be com­fortable tak­ing a sin­gle hu­man, and mak­ing them a quadrillion times more pow­er­ful?

Should you read it? I am cu­ri­ous to see peo­ple’s an­swers to this, I think it might be a good ques­tion to re­veal ma­jor differ­ences in wor­ld­views be­tween op­ti­mistic and pes­simistic safety re­searchers.

Refram­ing mis­al­igned AGI’s: well-in­ten­tioned non-neu­rotyp­i­cal as­sis­tants (Alex Zhu): Another way to think about prob­lems from AGI is to imag­ine the AI as a well-in­ten­tioned but neu­roatyp­i­cal friend, who learned all about hu­mans from Wikipe­dia, and who has ac­cess to im­mense re­sources. You would worry a lot about prin­ci­pal-agent prob­lems in such a sce­nario.

Should you read it? I like this fram­ing. I’m not sure if it is ac­tu­ally a good model for act-based agents, but it’s an­other way to think about what prob­lems could arise from an AI sys­tem that is su­per­in­tel­li­gent in some do­mains and sub­hu­man in oth­ers.

Read more: Act-based agents

Su­per­in­tel­li­gent mes­si­ahs are cor­rigible and prob­a­bly mis­al­igned (Alex Zhu)

Tech­ni­cal agen­das and prioritization

My take on agent foun­da­tions: for­mal­iz­ing metaphilo­soph­i­cal com­pe­tence (Alex Zhu): Sum­ma­rized in the high­lights!

Agent foundations

2018 re­search plans and pre­dic­tions (Rob Bens­inger): Scott and Nate from MIRI score their pre­dic­tions for re­search out­put in 2017 and make pre­dic­tions for re­search out­put in 2018.

Should you read it? I don’t know enough about MIRI to have any idea what the pre­dic­tions mean, but I’d still recom­mend read­ing it if you’re some­what fa­mil­iar with MIRI’s tech­ni­cal agenda to get a bird’s-eye view of what they have been fo­cus­ing on for the last year.

Pr­ereq­ui­si­ties: A ba­sic un­der­stand­ing of MIRI’s tech­ni­cal agenda (eg. what they mean by nat­u­ral­ized agents, de­ci­sion the­ory, Vingean re­flec­tion, and so on).

Mus­ings on Ex­plo­ra­tion (Alex Ap­pel): De­ci­sion the­o­ries re­quire some ex­plo­ra­tion in or­der to pre­vent the prob­lem of spu­ri­ous con­ter­fac­tu­als, where you con­di­tion on a zero-prob­a­bil­ity event. How­ever, there are prob­lems with ex­plo­ra­tion too, such as un­safe ex­plo­ra­tion (eg. launch­ing a nu­clear ar­se­nal in an ex­plo­ra­tion step), and a suffi­ciently strong agent seems to have an in­cen­tive to self-mod­ify to re­move the ex­plo­ra­tion, be­cause the ex­plo­ra­tion usu­ally leads to sub­op­ti­mal out­comes for the agent.

Should you read it? I liked the linked post that ex­plains why con­di­tion­ing on low-prob­a­bil­ity ac­tions is not the same thing as a coun­ter­fac­tual, but I’m not knowl­edge­able enough to un­der­stand what’s go­ing on in this post, so I can’t re­ally say whether or not you should read it.

Quan­tilal con­trol for finite MDPs (Vadim Kosoy)

Mis­cel­la­neous (Align­ment)

Papers from AI and So­ciety: Ethics, Safety and Trust­wor­thi­ness in In­tel­li­gent Agents

Guide Me: In­ter­act­ing with Deep Net­works (Chris­tian Rup­precht, Iro Laina et al)

Near-term concerns

Ad­ver­sar­ial examples

Ad­ver­sar­ial At­tacks and Defences Com­pe­ti­tion (Alexey Ku­rakin et al): Sum­ma­rized in the high­lights!


Poi­son Frogs! Tar­geted Clean-La­bel Poi­son­ing At­tacks on Neu­ral Net­works (Ali Sha­fahi, W. Ronny Huang et al): De­mon­strates a data poi­son­ing at­tack in which the ad­ver­sary gets to choose a poi­son in­put to add to the train­ing set, but does not get to choose its la­bel. The goal is to mis­clas­sify a sin­gle test in­stance as a spe­cific base class. They achieve this by cre­at­ing a poi­son in­put that looks like the base class in pixel space but looks like the test in­stance in fea­ture space (i.e. the ac­ti­va­tions in the penul­ti­mate layer). The poi­son in­put will be la­beled by hu­mans as the base class, and then when the net­work is re­trained with the origi­nal dataset and the new poi­soned in­put(s), it will clas­sify the poi­son in­put as the base class, and with it the test in­stance as well (since they have very similar fea­tures).

Should you read it? I was pleas­antly sur­prised at how un­der­stand­able the pa­per was, and they do a good job of look­ing at ex­actly what their method is do­ing and how it ac­com­plishes the at­tack in differ­ent ways un­der differ­ent set­tings.

Ma­nipu­lat­ing Ma­chine Learn­ing: Poi­son­ing At­tacks and Coun­ter­mea­sures for Re­gres­sion Learn­ing (Matthew Jagiel­ski et al)

AI strat­egy and policy

France’s AI strat­egy: See Im­port AI’s sum­mary.

Ini­tial Refer­ence Ar­chi­tec­ture of an In­tel­li­gent Au­tonomous Agent for Cy­ber Defense (Alexan­der Kott et al): See Im­port AI’s sum­mary.

AI capabilities

Re­in­force­ment learning

Retro Con­test (Christo­pher Hesse et al): OpenAI has re­leased Gym Retro, pro­vid­ing an in­ter­face to work with video games from SEGA Ge­n­e­sis, which are more com­plex than the ones from Atari. They want to use these en­vi­ron­ments to test trans­fer learn­ing in par­tic­u­lar, where the agent may be pre­trained on ini­tial lev­els for as long as de­sired, and then must learn how to com­plete a new test level with only 1 mil­lion timesteps (~18 hours) of game­play. (Hu­mans do well with 2 hours of pre­train­ing and 1 hour of play on the test level.)

Should you read it? If you want to keep track of progress in deep RL, prob­a­bly—this seems quite likely to be­come the new set of bench­marks that re­searchers work on. There’s also an­other ex­am­ple of speci­fi­ca­tion gam­ing in the post.

Learn­ing to nav­i­gate in cities with­out a map (Piotr Mirowski et al)

Deep learning

Univer­sal Plan­ning Net­works (Aravind Srini­vas et al): This is an ar­chi­tec­ture that has a differ­en­tiable plan­ning mod­ule, that is, a neu­ral net­work that takes in (en­cod­ings of) states or ob­ser­va­tions and pro­duces ac­tions. You can use this in con­junc­tion with eg. ex­pert demon­stra­tions (as in imi­ta­tion learn­ing) in or­der to learn fea­tures that are op­ti­mized for the pur­pose of plan­ning, fo­cus­ing only on the de­tails rele­vant to the task, un­like an auto-en­coder, which must re­con­struct the en­tire image, in­clud­ing ir­rele­vant de­tails.

Should you read it? It’s a good ex­am­ple of the push to­wards learn­ing more and more com­plex al­gorithms us­ing neu­ral nets (in this case, plan­ning). From a safety per­spec­tive, differ­en­tiable plan­ning net­works may be use­ful for mod­el­ing hu­mans.

No nominations.
No reviews.