Alignment Newsletter #22

Link post


AI Gover­nance: A Re­search Agenda (Allan Dafoe): A com­pre­hen­sive doc­u­ment about the re­search agenda at the Gover­nance of AI Pro­gram. This is re­ally long and cov­ers a lot of ground so I’m not go­ing to sum­ma­rize it, but I highly recom­mend it, even if you in­tend to work pri­mar­ily on tech­ni­cal work.

Tech­ni­cal AI alignment

Agent foundations

Agents and De­vices: A Rel­a­tive Defi­ni­tion of Agency (Lau­rent Orseau et al): This pa­per con­sid­ers the prob­lem of mod­el­ing other be­hav­ior, ei­ther as an agent (try­ing to achieve some goal) or as a de­vice (that re­acts to its en­vi­ron­ment with­out any clear goal). They use Bayesian IRL to model be­hav­ior as com­ing from an agent op­ti­miz­ing a re­ward func­tion, and de­sign their own prob­a­bil­ity model to model the be­hav­ior as com­ing from a de­vice. They then use Bayes rule to de­cide whether the be­hav­ior is bet­ter mod­eled as an agent or as a de­vice. Since they have a uniform prior over agents and de­vices, this ends up choos­ing the one that bet­ter fits the data, as mea­sured by log like­li­hood.

In their toy grid­world, agents are nav­i­gat­ing to­wards par­tic­u­lar lo­ca­tions in the grid­world, whereas de­vices are re­act­ing to their lo­cal ob­ser­va­tion (the type of cell in the grid­world that they are cur­rently fac­ing, as well as the pre­vi­ous ac­tion they took). They cre­ate a few en­vi­ron­ments by hand which demon­strate that their method in­fers the in­tu­itive an­swer given the be­hav­ior.

My opinion: In their ex­per­i­ments, they have two differ­ent model classes with very differ­ent in­duc­tive bi­ases, and their method cor­rectly switches be­tween the two classes de­pend­ing on which in­duc­tive bias works bet­ter. One of these classes is the max­i­miza­tion of some re­ward func­tion, and so we call that the agent class. How­ever, they also talk about us­ing the Solomonoff prior for de­vices—in that case, even if we have some­thing we would nor­mally call an agent, if it is even slightly sub­op­ti­mal, then with enough data the de­vice ex­pla­na­tion will win out.

I’m not en­tirely sure why they are study­ing this prob­lem in par­tic­u­lar—one rea­son is ex­plained in the next post, I’ll write more about it in that sec­tion.

Bot­tle Caps Aren’t Op­ti­misers (Daniel Filan): The pre­vi­ous pa­per de­tects op­ti­miz­ers by study­ing their be­hav­ior. How­ever, if the goal is to de­tect an op­ti­mizer be­fore de­ploy­ment, we need to de­ter­mine whether an al­gorithm is perform­ing op­ti­miza­tion by study­ing its source code, with­out run­ning it. One defi­ni­tion that peo­ple have come up with is that an op­ti­mizer is some­thing such that the ob­jec­tive func­tion at­tains higher val­ues than it oth­er­wise would have. How­ever, the au­thor thinks that this defi­ni­tion is in­suffi­cient. For ex­am­ple, this would al­low us to say that a bot­tle cap is an op­ti­mizer for keep­ing wa­ter in­side the bot­tle. Per­haps in this case we can say that there are sim­pler de­scrip­tions of bot­tle caps, so those should take prece­dence. But what about a liver? We could say that a liver is op­ti­miz­ing for its owner’s bank bal­ance, since in its ab­sence the bank bal­ance is not go­ing to in­crease.

My opinion: Here, we want a defi­ni­tion of op­ti­miza­tion be­cause we’re wor­ried about an AI be­ing de­ployed, op­ti­miz­ing for some met­ric in the en­vi­ron­ment, and then do­ing some­thing un­ex­pected that we don’t like but nonethe­less does in­crease the met­ric (fal­ling prey to Good­hart’s law). It seems bet­ter to me to talk about “op­ti­mizer” and “agent” as mod­els of pre­dict­ing be­hav­ior, not some­thing that is an in­her­ent prop­erty of the thing pro­duc­ing the be­hav­ior. Un­der that in­ter­pre­ta­tion, we want to figure out whether the agent model with a par­tic­u­lar util­ity func­tion is a good model for an AI sys­tem, by look­ing at its in­ter­nals (with­out run­ning it). It seems par­tic­u­larly im­por­tant to be able to use this model to pre­dict the be­hav­ior in novel situ­a­tions—per­haps that’s what is needed to make the defi­ni­tion of op­ti­mizer avoid the coun­terex­am­ples in this post. (A bot­tle cap definitely isn’t go­ing to keep wa­ter in con­tain­ers if it is sim­ply ly­ing on a table some­where.)

Us­ing ex­pected util­ity for Good(hart) (Stu­art Arm­strong): If we in­clude all of the un­cer­tainty we have about hu­man val­ues into the util­ity func­tion, then it seems pos­si­ble to de­sign an ex­pected util­ity max­i­mizer that doesn’t fall prey to Good­hart’s law. The post shows a sim­ple ex­am­ple where there are many vari­ables that may be of in­ter­est to hu­mans, but we’re not sure which ones. In this case, by in­cor­po­rat­ing this un­cer­tainty into our proxy util­ity func­tion, we can de­sign an ex­pected util­ity max­i­mizer that has con­ser­va­tive be­hav­ior that makes sense.

My opinion: On the one hand, I’m sym­pa­thetic to this view—for ex­am­ple, I see risk aver­sion as a heuris­tic lead­ing to good ex­pected util­ity max­i­miza­tion for bounded rea­son­ers on large timescales. On the other hand, an EU max­i­mizer still seems hard to al­ign, be­cause what­ever util­ity func­tion it gets, or dis­tri­bu­tion over util­ity func­tions, it will act as though that in­put is definitely true, which means that any­thing we fail to model will never make it into the util­ity func­tion. If you could have some sort of “un­re­solv­able” un­cer­tainty, some rea­son­ing (similar to the prob­lem of in­duc­tion) sug­gest­ing that you can never fully trust your own thoughts to be perfectly cor­rect, that would make me more op­ti­mistic about an EU max­i­miza­tion based ap­proach, but I don’t think it can be done by just chang­ing the util­ity func­tion, or by adding a dis­tri­bu­tion over them.

Cor­rigi­bil­ity doesn’t always have a good ac­tion to take (Stu­art Arm­strong): Stu­art has pre­vi­ously ar­gued that an AI could be put in situ­a­tions where no mat­ter what it does, it would af­fect the hu­man’s val­ues. In this short post, he notes that if you then say that it is pos­si­ble to have situ­a­tions where the AI can­not act cor­rigibly, then other prob­lems arise, such as how you can cre­ate a su­per­in­tel­li­gent cor­rigible AI that does any­thing at all (since any ac­tion that it takes would likely af­fect our val­ues some­how).

Com­pu­ta­tional com­plex­ity of RL with traps (Vadim Kosoy): A post ask­ing about com­plex­ity the­o­retic re­sults around RL, both with (un­known) de­ter­minis­tic and stochas­tic dy­nam­ics.

Co­op­er­a­tive Or­a­cles (Diffrac­tor)


The What, the Why, and the How of Ar­tifi­cial Ex­pla­na­tions in Au­to­mated De­ci­sion-Mak­ing (Tarek R. Be­sold et al)

Mis­cel­la­neous (Align­ment)

Do what we mean vs. do what we say (Ro­hin Shah): I wrote a post propos­ing that we define a “do what we mean” sys­tem to be one in which the thing be­ing op­ti­mized is la­tent (in the sense that it is not ex­plic­itly speci­fied, not that it has a prob­a­bil­ity dis­tri­bu­tion over it). Con­versely, a “do what we say” sys­tem ex­plic­itly op­ti­mizes some­thing pro­vided as an in­put. A lot of AI safety ar­gu­ments can be un­der­stood as say­ing that a pure “do what we say” AI will lead to catas­trophic out­comes. How­ever, this doesn’t mean that a “do what we mean” sys­tem is the way to go—it could be that we want a “do what we mean” core, along with a “do what we say” sub­sys­tem that makes sure that the AI always listens to eg. shut­down com­mands.

VOI is Only Non­nega­tive When In­for­ma­tion is Un­cor­re­lated With Fu­ture Ac­tion (Diffrac­tor): Nor­mally, the value of get­ting more in­for­ma­tion (VOI) is always non­nega­tive (for a ra­tio­nal agent), be­cause you can always take the same ac­tion you would have if you didn’t have the in­for­ma­tion, so your de­ci­sion will only im­prove. How­ever, if the in­for­ma­tion would cause you to have a differ­ent set of ac­tions available, as in many de­ci­sion the­ory ex­am­ples, then this proof no longer ap­plies, since you may no longer be able to take the ac­tion you would have oth­er­wise taken. As a re­sult, in­for­ma­tion can have nega­tive value.

AI strat­egy and policy

AI Gover­nance: A Re­search Agenda (Allan Dafoe): Sum­ma­rized in the high­lights!

Su­per­in­tel­li­gence Skep­ti­cism as a Poli­ti­cal Tool (Seth Baum)

Other progress in AI

Re­in­force­ment learning

In­tro­duc­ing a New Frame­work for Flex­ible and Re­pro­ducible Re­in­force­ment Learn­ing Re­search (Pablo Sa­muel Cas­tro and Marc G. Bel­le­mare): Re­searchers at Google have re­leased Dopamine, a small frame­work for RL re­search on Atari games, with four built-in agents—DQN, C51, a sim­plified ver­sion of Rain­bow, and the re­cent Im­plicit Quan­tile Net­work. There’s a par­tic­u­lar em­pha­sis on re­pro­ducibil­ity, by pro­vid­ing logs from train­ing runs, train­ing data, etc.

Dex­ter­ous Ma­nipu­la­tion with Re­in­force­ment Learn­ing: Effi­cient, Gen­eral, and Low-Cost (Henry Zhu et al)

Deep learning

Why Self-At­ten­tion? A Tar­geted Eval­u­a­tion of Neu­ral Ma­chine Trans­la­tion Ar­chi­tec­tures (Gongbo Tang et al)

Trans­fer Learn­ing for Es­ti­mat­ing Causal Effects us­ing Neu­ral Net­works (Sören R. Künzel, Bradly C. Stadie et al)

Un­su­per­vised learning

Un­su­per­vised Learn­ing of Syn­tac­tic Struc­ture with In­vert­ible Neu­ral Pro­jec­tions (Junx­ian He et al)


LIFT: Re­in­force­ment Learn­ing in Com­puter Sys­tems by Learn­ing From De­mon­stra­tions (Michael Schaarschmidt et al)


80,000 Hours Job Board: AI/​ML safety re­search: 80,000 Hours re­cently up­dated their job board, in­clud­ing the sec­tion on tech­ni­cal safety re­search. The AI strat­egy and gov­er­nance sec­tion is prob­a­bly also of in­ter­est.

BERI/​CHAI ML en­g­ineer: I want to high­light this role in par­tic­u­lar—I ex­pect this to be a po­si­tion where you can not only have a large im­pact, but also learn more about tech­ni­cal re­search, putting you in a bet­ter po­si­tion to do re­search in the fu­ture.

HLAI 2018 Field Re­port (G Gor­don Wor­ley III): A re­port on the hu­man-level AI mul­ti­con­fer­ence from the per­spec­tive of a safety re­searcher who at­tended. The re­flec­tions are more about the state of the field rather than about tech­ni­cal in­sights gained. For ex­am­ple, he got the im­pres­sion that most re­searchers work­ing on AGI hadn’t thought deeply about safety. Based on this, he has two recom­men­da­tions—first, that we nor­mal­ize think­ing about AI safety, and sec­ond, that we es­tab­lish a “sink” for dan­ger­ous AI re­search.

My opinion: I definitely agree that we need to nor­mal­ize think­ing about AI safety, and I think that’s been hap­pen­ing. In fact, I think of that as one of the ma­jor benefits of writ­ing this newslet­ter, even though I started it with AI safety re­searchers in mind (who still re­main the au­di­ence I write for, if not the au­di­ence I ac­tu­ally have). I’m less con­vinced that we should have a pro­cess for dan­ger­ous AI re­search. What counts as dan­ger­ous? Cer­tainly this makes sense for AI re­search that can be dan­ger­ous in the short term, such as re­search that has mil­i­tary or surveillance ap­pli­ca­tions, but what would be dan­ger­ous from a long-term per­spec­tive? It shouldn’t just be re­search that differ­en­tially benefits gen­eral AI over long-term safety, since that’s al­most all AI re­search. And even though on the cur­rent mar­gin I would want re­search to differ­en­tially ad­vance safety, it feels wrong to call other re­search dan­ger­ous, es­pe­cially given its enor­mous po­ten­tial for good.

State of Cal­ifor­nia En­dorses Asilo­mar AI Prin­ci­ples (FLI Team)

No comments.