Alignment Newsletter #45

Link post

Find all Align­ment Newslet­ter re­sources here. In par­tic­u­lar, you can sign up, or look through this spread­sheet of all sum­maries that have ever been in the newslet­ter.


Learn­ing Prefer­ences by Look­ing at the World (Ro­hin Shah and Dmitrii Krashen­in­nikov): The key idea with this pro­ject that I worked on is that the state of the world is already op­ti­mized for our prefer­ences, and so sim­ply by look­ing at the world we can in­fer these prefer­ences. Con­sider the case where there is a vase stand­ing up­right on the table. This is an un­sta­ble equil­ibrium—it’s very easy to knock over the vase so it is ly­ing side­ways, or is com­pletely bro­ken. The fact that this hasn’t hap­pened yet sug­gests that we care about vases be­ing up­right and in­tact; oth­er­wise at some point we prob­a­bly would have let it fall.

Since we have op­ti­mized the world for our prefer­ences, the nat­u­ral ap­proach is to model this pro­cess, and then in­vert it to get the prefer­ences. You could imag­ine that we could con­sider all pos­si­ble re­ward func­tions, and put prob­a­bil­ity mass on them in pro­por­tion to how likely they make the cur­rent world state if a hu­man op­ti­mized them. Ba­si­cally, we are simu­lat­ing the past in or­der to figure out what must have hap­pened and why. With the vase ex­am­ple, we would no­tice that in any re­ward func­tion where hu­mans wanted to break vases, or were in­differ­ent to bro­ken vases, we would ex­pect the cur­rent state to con­tain bro­ken vases. Since we don’t ob­serve that, it must be the case that we care about keep­ing vases in­tact.

Our al­gorithm, Re­ward Learn­ing by Si­mu­lat­ing the Past (RLSP), takes this in­tu­ition and ap­plies it in the frame­work of Max­i­mum Causal En­tropy IRL (AN #12), where you as­sume that the hu­man was act­ing over T timesteps to pro­duce the state that you ob­serve. We then show a few grid­world en­vi­ron­ments in which ap­ply­ing RLSP can fix a mis­speci­fied re­ward func­tion.

Ro­hin’s opinion: In ad­di­tion to this blog post and the pa­per, I also wrote a post on the Align­ment Fo­rum ex­press­ing opinions about the work. There are too many dis­parate opinions to put in here, so I’d recom­mend read­ing the post it­self. I guess one thing I’ll men­tion is that to in­fer prefer­ences with a sin­gle state, you definitely need a good dy­nam­ics model, and a good set of fea­tures. While this may seem difficult to get, it’s worth not­ing that dy­nam­ics are em­piri­cal facts about the world, and fea­tures might be, and there is already lots of work on learn­ing both dy­nam­ics and fea­tures.

Tech­ni­cal AI alignment

Iter­ated am­plifi­ca­tion sequence

Se­cu­rity am­plifi­ca­tion (Paul Chris­ti­ano): If we imag­ine hu­mans as rea­son­ers over nat­u­ral lan­guage, there are prob­a­bly some es­o­teric sen­tences that could cause “failure”. For ex­am­ple, maybe there are un­rea­son­ably con­vinc­ing ar­gu­ments that cause the hu­man to be­lieve some­thing, when they shouldn’t have been con­vinced by the ar­gu­ment. Maybe they are tricked or threat­ened in a way that “shouldn’t” have hap­pened. The goal with se­cu­rity am­plifi­ca­tion is to make these sorts of sen­tences difficult to find, so that we will not come across them in prac­tice. As with Reli­a­bil­ity am­plifi­ca­tion (AN #44), we are try­ing to am­plify a fast agent A into a slow agent A* that is “more se­cure”, mean­ing that it is mul­ti­plica­tively harder to find an in­put that causes a catas­trophic failure.

You might ex­pect that ca­pa­bil­ity am­plifi­ca­tion (AN #42) would also im­prove se­cu­rity, since the more ca­pa­ble agent would be able to no­tice failure modes and re­move them. How­ever, this would likely take far too long.

In­stead, we can hope to achieve se­cu­rity am­plifi­ca­tion by mak­ing rea­son­ing ab­stract and ex­plicit, with the hope that when rea­son­ing is ex­plicit it be­comes harder to trig­ger the un­der­ly­ing failure mode, since you have to get your at­tack “through” the ab­stract rea­son­ing. I be­lieve a fu­ture post will talk about this more, so I’ll leave the de­tails till then. Another op­tion would be for the agent to act stochas­ti­cally; for ex­am­ple, when it needs to gen­er­ate a sub­ques­tion, it gen­er­ates many differ­ent word­ings of the sub­ques­tion and chooses one ran­domly. If only one of the word­ings can trig­ger the failure, then this re­duces the failure prob­a­bil­ity.

Ro­hin’s opinion: This is the coun­ter­point to Reli­a­bil­ity am­plifi­ca­tion (AN #44) from last week, and the same con­fu­sion I had last week still ap­ply, so I’m go­ing to re­frain from an opinion.


Con­struct­ing Good­hart (john­swent­worth): This post makes the point that Good­hart’s Law is so com­mon in prac­tice be­cause if there are sev­eral things that we care about, then we are prob­a­bly at or close to a Pareto-op­ti­mal point with re­spect to those things, and so choos­ing any one of them as a proxy met­ric to op­ti­mize will cause the other things to be­come worse, lead­ing to Good­hart effects.

Ro­hin’s opinion: This is an im­por­tant point about Good­hart’s Law. If you take some “ran­dom” or un­op­ti­mized en­vi­ron­ment, and then try to op­ti­mize some proxy for what you care about, it will prob­a­bly work quite well. It’s only when the en­vi­ron­ment is already op­ti­mized that Good­hart effects are par­tic­u­larly bad.

Im­pos­si­bil­ity and Uncer­tainty The­o­rems in AI Value Align­ment (or why your AGI should not have a util­ity func­tion) (Peter Eck­er­sley) (sum­ma­rized by Richard): This pa­per dis­cusses some im­pos­si­bil­ity the­o­rems re­lated to the Repug­nant con­clu­sion in pop­u­la­tion ethics (i.e. the­o­rems show­ing that no moral the­ory si­mul­ta­neously satis­fies cer­tain sets of in­tu­itively de­sir­able prop­er­ties). Peter ar­gues that in the con­text of AI it’s best to treat these the­o­rems as un­cer­tainty re­sults, ei­ther by al­low­ing in­com­men­su­rate out­comes or by al­low­ing prob­a­bil­is­tic moral judge­ments. He hy­poth­e­sises that “the emer­gence of in­stru­men­tal sub­goals is deeply con­nected to moral cer­tainty”, and so im­ple­ment­ing un­cer­tain ob­jec­tive func­tions is a path to mak­ing AI safer.

Richard’s opinion: The more gen­eral ar­gu­ment un­der­ly­ing this post is that al­ign­ing AGI will be hard partly be­cause ethics is hard (as dis­cussed here). I agree that us­ing un­cer­tain ob­jec­tive func­tions might help with this prob­lem. How­ever, I’m not con­vinced that it’s use­ful to frame this is­sue in terms of im­pos­si­bil­ity the­o­rems and nar­row AI, and would like to see these ideas laid out in a philo­soph­i­cally clearer way.

Iter­ated amplification

HCH is not just Me­chan­i­cal Turk (William Saun­ders): In Hu­mans Con­sult­ing HCH (HCH) (AN #34) a hu­man is asked a ques­tion and is sup­posed to re­turn an an­swer. The hu­man can ask sub­ques­tions, which are del­e­gated to an­other copy of the hu­man, who can ask sub­sub­ques­tions, ad in­fini­tum. This post points out that HCH has a free pa­ram­e­ter—the base hu­man policy. We could imag­ine e.g. tak­ing a Me­chan­i­cal Turk worker and us­ing them as the base hu­man policy, and we could ar­gue that HCH would give good an­swers in this set­ting as long as the worker is well-mo­ti­vated, since he is us­ing “hu­man-like” rea­son­ing. How­ever, there are other al­ter­na­tives. For ex­am­ple, in the­ory we could for­mal­ize a “core” of rea­son­ing. For con­crete­ness, sup­pose we im­ple­ment a lookup table for “sim­ple” ques­tions, and then use this lookup table. We might ex­pect this to be safe be­cause of the­o­rems that we proved about the lookup table, or by look­ing at the pro­cess by which the de­vel­op­ment team cre­ated the lookup table. In be­tween these two ex­tremes, we could imag­ine that the AI re­searchers train the hu­man over­seers about how to cor­rigibly an­swer ques­tions, and then the hu­man policy is used in HCH. This seems dis­tinctly more likely to be safe than the first case.

Ro­hin’s opinion: I strongly agree with the gen­eral point that we can get sig­nifi­cant safety by im­prov­ing the hu­man policy (AN #43), es­pe­cially with HCH and iter­ated am­plifi­ca­tion, since they de­pend on hav­ing good hu­man over­seers, at least ini­tially.

Re­in­force­ment Learn­ing in the Iter­ated Am­plifi­ca­tion Frame­work (William Saun­ders): This post and its com­ments clar­ify how we can use re­in­force­ment learn­ing for the dis­til­la­tion step in iter­ated am­plifi­ca­tion. The dis­cus­sion is still hap­pen­ing so I don’t want to sum­ma­rize it yet.

Learn­ing hu­man intent

Learn­ing Prefer­ences by Look­ing at the World (Ro­hin Shah and Dmitrii Krashen­in­nikov): Sum­ma­rized in the high­lights!

Prevent­ing bad behavior

Test Cases for Im­pact Reg­u­lari­sa­tion Meth­ods (Daniel Filan): This post col­lects var­i­ous test cases that re­searchers have pro­posed for im­pact reg­u­lariza­tion meth­ods. A sum­mary of each one would be far too long for this newslet­ter, so you’ll have to read the post it­self.

Ro­hin’s opinion: Th­ese test cases and the as­so­ci­ated com­men­tary sug­gest to me that we haven’t yet set­tled on what prop­er­ties we’d like our im­pact reg­u­lariza­tion meth­ods to satisfy, since there are pairs of test cases that seem hard to solve si­mul­ta­neously, as well as test cases where the de­sired be­hav­ior is un­clear.


Neu­ral Net­works seem to fol­low a puz­zlingly sim­ple strat­egy to clas­sify images (Wie­land Bren­del and Matthias Bethge): This is a blog post ex­plain­ing the pa­per Ap­prox­i­mat­ing CNNs with bag-of-lo­cal-fea­tures mod­els works sur­pris­ingly well on ImageNet, which was sum­ma­rized in AN #33.


AI Align­ment Pod­cast: The Byzan­tine Gen­er­als’ Prob­lem, Poi­son­ing, and Distributed Ma­chine Learn­ing (Lu­cas Perry and El Mahdi El Mah­mdi) (sum­ma­rized by Richard): Byzan­tine re­silience is the abil­ity of a sys­tem to op­er­ate suc­cess­fully when some of its com­po­nents have been cor­rupted, even if it’s un­clear which ones they are. In the con­text of ma­chine learn­ing, this is rele­vant to poi­son­ing at­tacks in which some train­ing data is al­tered to af­fect the batch gra­di­ent (one ex­am­ple be­ing the ac­tivity of fake ac­counts on so­cial me­dia sites). El Mahdi ex­plains that when data is very high-di­men­sional, it is easy to push a neu­ral net­work into a bad lo­cal min­i­mum by al­ter­ing only a small frac­tion of the data. He ar­gues that his work on miti­gat­ing this is rele­vant to AI safety: even su­per­in­tel­li­gent AGI will be vuln­er­a­ble to data poi­son­ing due to time con­straints on com­pu­ta­tion, and the fact that data poi­son­ing is eas­ier than re­silient learn­ing.

Trust­wor­thy Deep Learn­ing Course (Ja­cob Stein­hardt, Dawn Song, Trevor Dar­rell) (sum­ma­rized by Dan H): This un­der­way course cov­ers top­ics in AI Safety top­ics for cur­rent deep learn­ing sys­tems. The course in­cludes slides and videos.

AI strat­egy and policy

How Sure are we about this AI Stuff? (Ben Garfinkel) (sum­ma­rized by Richard): Ben out­lines four broad ar­gu­ments for pri­ori­tis­ing work on su­per­in­tel­li­gent AGI: that AI will have a big in­fluence over the long-term fu­ture, and more speci­fi­cally that it might cause in­sta­bil­ity, lock-in or large-scale “ac­ci­dents”. He notes the draw­backs of each line of ar­gu­ment. In par­tic­u­lar, the “AI is a big deal” ar­gu­ment doesn’t show that we have use­ful lev­er­age over out­comes (com­pare a Vic­to­rian try­ing to im­prove the long-term effects of the in­dus­trial rev­olu­tion). He claims that the next two ar­gu­ments have sim­ply not been re­searched thor­oughly enough to draw any con­clu­sions. And while the ar­gu­ment from ac­ci­dents has been made by Bostrom and Yud­kowsky, there hasn’t been suffi­cient elab­o­ra­tion or crit­i­cism of it, es­pe­cially in light of the re­cent rise of deep learn­ing, which re­frames many ideas in AI.

Richard’s opinion: I find this talk to be em­i­nently rea­son­able through­out. It high­lights a con­cern­ing lack of pub­lic high-qual­ity en­gage­ment with the fun­da­men­tal ideas in AI safety over the last few years, rel­a­tive to the growth of the field as a whole (al­though note that in the past few months this has been chang­ing, with three ex­cel­lent se­quences re­leased on the Align­ment Fo­rum, plus Drexler’s tech­ni­cal re­port). This is some­thing which mo­ti­vates me to spend a fair amount of time writ­ing about and dis­cussing such ideas.

One nit­pick: I dis­like the use of “ac­ci­dents” as an um­brella term for AIs be­hav­ing in harm­ful ways un­in­tended by their cre­ators, since it’s mis­lead­ing to de­scribe de­liber­ately ad­ver­sar­ial be­havi­our as an “ac­ci­dent” (al­though note that this is not spe­cific to Ben’s talk, since the ter­minol­ogy has been in use at least since the Con­crete prob­lems pa­per).

Sum­mary of the 2018 Depart­ment of Defense Ar­tifi­cial In­tel­li­gence Strat­egy (DOD)

Other progress in AI

Re­in­force­ment learning

The Han­abi Challenge: A New Fron­tier for AI Re­search (Nolan Bard, Jakob Fo­er­ster et al) (sum­ma­rized by Richard): The au­thors pro­pose the co­op­er­a­tive, im­perfect-in­for­ma­tion card game Han­abi as a tar­get for AI re­search, due to the ne­ces­sity of rea­son­ing about the be­liefs and in­ten­tions of other play­ers in or­der to win. They iden­tify two challenges: firstly, dis­cov­er­ing a policy for a whole team that al­lows it to win (the self-play set­ting); and sec­ondly, dis­cov­er­ing an in­di­vi­d­ual policy that al­lows an agent to play with an ad-hoc team with­out pre­vi­ous co­or­di­na­tion. They note that suc­cess­ful self-play poli­cies are of­ten very brit­tle in the ad-hoc set­ting, which makes the lat­ter the key prob­lem. The au­thors provide an open-source frame­work, an eval­u­a­tion bench­mark and the re­sults of ex­ist­ing RL tech­niques.

Richard’s opinion: I en­dorse the goals of this pa­per, but my guess is that Han­abi is sim­ple enough that agents can solve it us­ing iso­lated heuris­tics rather than gen­eral rea­son­ing about other agents’ be­liefs.

Ro­hin’s opinion: I’m par­tic­u­larly ex­cited to see more work on ad hoc team­work, since it seems like very similar to the set­ting we are in, where we would like to de­ploy AI sys­tem among groups of hu­mans and have things go well. See Fol­low­ing hu­man norms (AN #42) for more de­tails.

Read more: A co­op­er­a­tive bench­mark: An­nounc­ing the Han­abi Learn­ing Environment

A Com­par­a­tive Anal­y­sis of Ex­pected and Distri­bu­tional Re­in­force­ment Learn­ing (Clare Lyle et al) (sum­ma­rized by Richard): Distri­bu­tional RL sys­tems learn dis­tri­bu­tions over the value of ac­tions rather than just their ex­pected val­ues. In this pa­per, the au­thors in­ves­ti­gate the rea­sons why this tech­nique im­proves re­sults, by train­ing dis­tri­bu­tion learner agents and ex­pec­ta­tion learner agents on the same data. They provide ev­i­dence against a num­ber of hy­pothe­ses: that dis­tri­bu­tional RL re­duces var­i­ance; that dis­tri­bu­tional RL helps with policy iter­a­tion; and that dis­tri­bu­tional RL is more sta­ble with func­tion ap­prox­i­ma­tion. In fact, dis­tri­bu­tional meth­ods have similar perfor­mance to ex­pec­ta­tion meth­ods when us­ing tab­u­lar rep­re­sen­ta­tions or lin­ear func­tion ap­prox­i­ma­tors, but do bet­ter when us­ing non-lin­ear func­tion ap­prox­i­ma­tors such as neu­ral net­works (es­pe­cially in the ear­lier lay­ers of net­works).

Richard’s opinion: I like this sort of re­search, and its find­ings are in­ter­est­ing (even if the au­thors don’t ar­rive at any clear ex­pla­na­tion for them). One con­cern: I may be miss­ing some­thing, but it seems like the cou­pled sam­ples method they use doesn’t al­low in­ves­ti­ga­tion into whether dis­tri­bu­tional meth­ods benefit from gen­er­at­ing bet­ter data (e.g. via more effec­tive ex­plo­ra­tion).

Re­cur­rent Ex­pe­rience Re­play in Distributed Re­in­force­ment Learn­ing (Steven Kap­tur­owski et al): See Im­port AI.

Vi­sual Hind­sight Ex­pe­rience Re­play (Hi­man­shu Sahni et al)

A Geo­met­ric Per­spec­tive on Op­ti­mal Rep­re­sen­ta­tions for Re­in­force­ment Learn­ing (Marc G. Bel­le­mare et al)

The Value Func­tion Poly­tope in Re­in­force­ment Learn­ing (Robert Dadashi et al)

Deep learning

A Con­ser­va­tive Hu­man Baseline Es­ti­mate for GLUE: Peo­ple Still (Mostly) Beat Machines (Nik­ita Nan­gia et al) (sum­ma­rized by Dan H): BERT tremen­dously im­proves perfor­mance on sev­eral NLP datasets, such that it has “taken over” NLP. GLUE rep­re­sents perfor­mance of NLP mod­els across a broad range of NLP datasets. Now GLUE has hu­man perfor­mance mea­sure­ments. Ac­cord­ing to the cur­rent GLUE leader­board, the gap be­tween hu­man perfor­mance and mod­els fine-tuned on GLUE datasets is a mere 4.7%. Hence many cur­rent NLP datasets are nearly “solved.”


Gover­nance of AI Fel­low­ship (Markus An­der­ljung): The Cen­ter for the Gover­nance of AI is look­ing for a few fel­lows to work for around 3 months on AI gov­er­nance re­search. They ex­pect that fel­lows will be at the level of PhD stu­dents or post­docs, though there are no strict re­quire­ments. The first round ap­pli­ca­tion dead­line is Feb 28, and the sec­ond round ap­pli­ca­tion dead­line is Mar 28.