Alignment Newsletter #49

Link post

Find all Align­ment Newslet­ter re­sources here. In par­tic­u­lar, you can sign up, or look through this spread­sheet of all sum­maries that have ever been in the newslet­ter.


Ex­plor­ing Neu­ral Net­works with Ac­ti­va­tion At­lases (Shan Carter et al): Pre­vi­ous work by this group of peo­ple in­cludes The Build­ing Blocks of In­ter­pretabil­ity and Fea­ture Vi­su­al­iza­tion, both of which ap­par­ently came out be­fore this newslet­ter started so I don’t have a sum­mary to point to. Those were pri­mar­ily about un­der­stand­ing what in­di­vi­d­ual neu­rons in an image clas­sifer were re­spond­ing to, and the key idea was to “name” each neu­ron with the in­put that would max­i­mally ac­ti­vate that neu­ron. This can give you a global view of what the net­work is do­ing.

How­ever, such a global view makes it hard to un­der­stand the in­ter­ac­tion be­tween neu­rons. To un­der­stand these, we can look at a spe­cific in­put image, and use tech­niques like at­tri­bu­tion. Rather than at­tribute fi­nal clas­sifi­ca­tions to the in­put, you could at­tribute clas­sifi­ca­tions to neu­rons in the net­work, and then since in­di­vi­d­ual neu­rons now had mean­ings (roughly: “fuzzy tex­ture neu­ron”, “ten­nis ball neu­ron”, etc) you can gain in­sight to how the net­work is mak­ing de­ci­sions for that spe­cific in­put.

How­ever, ideally we would like to see how the net­work uses in­ter­ac­tions be­tween neu­rons to make de­ci­sions in gen­eral; not on a sin­gle image. This mo­ti­vates ac­ti­va­tion at­lases, which an­a­lyze the ac­ti­va­tions of a net­work on a large dataset of in­puts. In par­tic­u­lar, for each of a mil­lion images, they ran­domly choose a non-bor­der patch from the image, and com­pute the ac­ti­va­tion vec­tor at a par­tic­u­lar layer of the net­work at that patch. This gives a dataset of a mil­lion ac­ti­va­tion vec­tors. They use stan­dard di­men­sion­al­ity re­duc­tion tech­niques to map each ac­ti­va­tion vec­tor into an (x, y) point on the 2D plane. They di­vide the 2D plane into a rea­son­ably sized grid (e.g. 50x50), and for each grid cell they com­pute the av­er­age of all the ac­ti­va­tion vec­tors in the cell, vi­su­al­ize that ac­ti­va­tion vec­tor us­ing fea­ture vi­su­al­iza­tion, and put the re­sult­ing image into the grid cell. This gives a 50x50 grid of the “con­cepts” that the par­tic­u­lar neu­ral net­work layer we are an­a­lyz­ing can rea­son about. They also use at­tri­bu­tion to show, for each grid cell, which class that grid cell most sup­ports.

The pa­per then goes into a lot of de­tail about what we can in­fer from the ac­ti­va­tion at­las. For ex­am­ple, we can see that paths in ac­ti­va­tion vec­tor space can cor­re­spond to hu­man-in­ter­pretable con­cepts like the num­ber of ob­jects in an image, or mov­ing from wa­ter to beaches to rocky cliffs. If we look at ac­ti­va­tion at­lases for differ­ent lay­ers, we can see that the later lay­ers seem to get much more spe­cific and com­plex, and formed of com­bi­na­tions of pre­vi­ous fea­tures (e.g. com­bin­ing sand and wa­ter fea­tures to get a sin­gle sand­bar fea­ture).

By look­ing at images for spe­cific classes, we can use at­tri­bu­tion to see which parts of an ac­ti­va­tion at­las are most rele­vant for the class. By com­par­ing across classes, we can see how the net­work makes de­ci­sions. For ex­am­ple, for fire­boats vs. street­cars, the net­work looks for win­dows for both, crane-like struc­tures for both (though less than win­dows), and wa­ter for fire­boats vs. build­ings for street­cars. This sort of anal­y­sis can also help us find mis­takes in rea­son­ing—e.g. look­ing at the differ­ence be­tween grey whales and great white sharks, we can see that the net­work looks for the teeth and mouth of a great white shark, in­clud­ing an ac­ti­va­tion that looks sus­pi­ciously like a base­ball. In fact, if you take a grey whale and put a patch of a base­ball in the top left cor­ner, this be­comes an ad­ver­sar­ial ex­am­ple that fools the net­work into think­ing the grey whale is a great white shark. They run a bunch of ex­per­i­ments with these hu­man-found ad­ver­sar­ial ex­am­ples and find they are quite effec­tive.

Ro­hin’s opinion: While the au­thors pre­sent this as a method for un­der­stand­ing how neu­rons in­ter­act, it seems to me that the key in­sight is about look­ing at and ex­plain­ing the be­hav­ior of the neu­ral net­work on data points in-dis­tri­bu­tion. Most pos­si­ble in­puts are off-dis­tri­bu­tion, and there is not much to be gained by un­der­stand­ing what the net­work does on these points. Tech­niques that aim to gain a global un­der­stand­ing of the net­work are go­ing to be “ex­plain­ing” the be­hav­ior of the net­work on such points as well, and so will be pre­sent­ing data that we won’t be able to in­ter­pret. By look­ing speci­fi­cally at ac­ti­va­tions cor­re­spond­ing to in-dis­tri­bu­tion images, we can en­sure that the data we’re vi­su­al­iz­ing is in-dis­tri­bu­tion and is ex­pected to make sense to us.

I’m pretty ex­cited that in­ter­pretabil­ity tech­niques have got­ten good enough that they al­low us to con­struct ad­ver­sar­ial ex­am­ples “by hand”—that seems like a clear demon­stra­tion that we are learn­ing some­thing real about the net­work. It feels like the next step would be to use in­ter­pretabil­ity tech­niques to en­able us to ac­tu­ally fix the net­work—though ad­mit­tedly this would re­quire us to also de­velop meth­ods that al­low hu­mans to “tweak” net­works, which doesn’t re­ally fit within in­ter­pretabil­ity re­search as nor­mally defined.

Read more: OpenAI blog post and Google AI blog post

Fea­ture Denois­ing for Im­prov­ing Ad­ver­sar­ial Ro­bust­ness (Cihang Xie et al) (sum­ma­rized by Dan H): This pa­per claims to ob­tain non­triv­ial ad­ver­sar­ial ro­bust­ness on ImageNet. As­sum­ing an ad­ver­sary can add per­tur­ba­tions of size 16255 (l_in­finity), pre­vi­ous ad­ver­sar­i­ally trained clas­sifiers could not ob­tain above 1% ad­ver­sar­ial ac­cu­racy. Some groups have tried to break the model pro­posed in this pa­per, but so far it ap­pears its ro­bust­ness is close to what it claims, around 40% ad­ver­sar­ial ac­cu­racy. Vanilla ad­ver­sar­ial train­ing is how they ob­tain said ad­ver­sar­ial ro­bust­ness. There has only been one pre­vi­ous pub­lic at­tempt at ap­ply­ing (mul­ti­step) ad­ver­sar­ial train­ing to ImageNet, as those at uni­ver­si­ties sim­ply do not have the GPUs nec­es­sary to perform ad­ver­sar­ial train­ing on 224x224 images. Un­like the pre­vi­ous at­tempt, this pa­per os­ten­si­bly uses bet­ter hy­per­pa­ram­e­ters, pos­si­bly ac­count­ing for the dis­crep­ancy. If true, this re­sult re­minds us that hy­per­pa­ram­e­ter tun­ing can be crit­i­cal even in vi­sion, and that im­prov­ing ad­ver­sar­ial ro­bust­ness on large-scale images may not be pos­si­ble out­side in­dus­try for many years.

Tech­ni­cal AI alignment

Learn­ing hu­man intent

Us­ing Causal Anal­y­sis to Learn Speci­fi­ca­tions from Task De­mon­stra­tions (Daniel An­gelov et al)

Re­ward learn­ing theory

A the­ory of hu­man val­ues (Stu­art Arm­strong): This post pre­sents an out­line of how to con­struct a the­ory of hu­man val­ues. First, we need to in­fer prefer­ences and meta-prefer­ences from hu­mans who are in “rea­son­able” situ­a­tions. Then we need to syn­the­size these into a util­ity func­tion, by re­solv­ing con­tra­dic­tions be­tween prefer­ences, ap­ply­ing meta-prefer­ences to prefer­ences, and hav­ing a way of chang­ing the pro­ce­dures used to do the pre­vi­ous two things. We then need to ar­gue that this leads to ad­e­quate out­comes—he gives some sim­ple ar­gu­ments for this, that rely on par­tic­u­lar facts about hu­mans (such as the fact that they are scope in­sen­si­tive).

Prevent­ing bad behavior

De­sign­ing agent in­cen­tives to avoid side effects (Vic­to­ria Krakovna et al): This blog post pro­vides de­tails about the re­cent up­date to the rel­a­tive reach­a­bil­ity pa­per (AN #10), which is now more a pa­per about the de­sign choices available with im­pact mea­sures. There are three main axes that they iden­tify:

First, what baseline is im­pact mea­surede rel­a­tive to? A nat­u­ral choice is to com­pare against the start­ing state, but this will pe­nal­ize the agent for en­vi­ron­ment effects, such as ap­ples grow­ing on trees. We can in­stead com­pare against an in­ac­tion baseline, i.e. mea­sur­ing im­pact rel­a­tive to what would have hap­pened if the agent did noth­ing. Un­for­tu­nately, this leads to offset­ting be­hav­ior: the agent first makes a change to get re­ward, and then un­does the change in or­der to not be pe­nal­ized for im­pact. This mo­ti­vates the step­wise in­ac­tion baseline, which com­pares each ac­tion against what would have hap­pened if the agent did noth­ing from that step on­wards.

Se­cond, we need a mea­sure by which to com­pare states. The un­reach­a­bil­ity mea­sure mea­sures how hard it is to reach the baseline from the cur­rent state. How­ever, this “maxes out” as soon as the baseline is un­reach­a­bil­ity, and so there is no in­cen­tive to avoid fur­ther ir­re­versible ac­tions. This mo­ti­vates rel­a­tive reach­a­bil­ity, which com­putes the set of states reach­able from the baseline, and mea­sures what pro­por­tion of those states are reach­able from the state cre­ated by the agent. At­tain­able util­ity (AN #25) gen­er­al­izes this to talk about the util­ity that could be achieved from the baseline for a wide range of util­ity func­tions. (This is equiv­a­lent to rel­a­tive reach­a­bil­ity when the util­ity func­tions are of the form “1 if state s is ever en­coun­tered, else 0″.)

Fi­nally, we need to figure how to pe­nal­ize changes in our cho­sen mea­sure. Pe­nal­iz­ing de­creases in the mea­sure al­lows us to pe­nal­ize ac­tions that make it harder to do things (what the AUP post calls “op­por­tu­nity cost”), while pe­nal­iz­ing in­creases in the mea­sure al­lows us to pe­nal­ize con­ver­gent in­stru­men­tal sub­goals (which al­most by defi­ni­tion in­crease the abil­ity to satisfy many differ­ent goals or reach many differ­ent states).

Ro­hin’s opinion: Since the AUP post was pub­lished about half a year ago, I’ve been watch­ing this unifi­ca­tion of AUP and rel­a­tive reach­a­bil­ity slowly take form, since they were phrased very differ­ently ini­tially. I’m glad to see this fi­nally ex­plained clearly and con­cisely, with ex­per­i­ments show­ing the effect of each choice. I do want to put spe­cial em­pha­sis on the in­sight of AUP that the pur­suit of con­ver­gent in­stru­men­tal sub­goals leads to large in­creases in “abil­ity to do things”, and thus that pe­nal­iz­ing in­creases can help avoid such sub­goals. This point doesn’t typ­i­cally make it into the aca­demic writ­ings on the sub­ject but seems quite im­por­tant.

On the topic of im­pact mea­sures, I’ll re­peat what I’ve said be­fore: I think that it’s hard to satisfy the con­junc­tion of three desider­ata—ob­jec­tivity (no de­pen­dence on hu­man val­ues), safety (pre­vent­ing any catas­trophic out­comes) and use­ful­ness (the AI sys­tem is still able to do use­ful things). Im­pact mea­sures are very clearly aiming for the first two crite­ria, but usu­ally don’t have much to say about the third one. My ex­pec­ta­tion is that there is a strong trade­off be­tween the first two crite­ria and the third one, and im­pact mea­sures have not dealt with this fact yet, but will have to at some point.

Con­ser­va­tive Agency via At­tain­able Utility Preser­va­tion (Alexan­der Matt Turner et al): This pa­per pre­sents in a more aca­demic for­mat a lot of the con­tent that Alex has pub­lished about at­tain­able util­ity preser­va­tion, see Towards a New Im­pact Mea­sure (AN #25) and Pe­nal­iz­ing Im­pact via At­tain­able Utility Preser­va­tion(AN #39).


Ex­plor­ing Neu­ral Net­works with Ac­ti­va­tion At­lases (Shan Carter et al): Sum­ma­rized in the high­lights!

Ad­ver­sar­ial examples

Fea­ture Denois­ing for Im­prov­ing Ad­ver­sar­ial Ro­bust­ness (Cihang Xie et al): Sum­ma­rized in the high­lights!


Signup form for AI Me­tac­u­lus (Ja­cob Lager­ros and Ben Gold­haber): Re­cently, fore­cast­ing plat­form Me­tac­u­lus launched a new in­stance ded­i­cated speci­fi­cally to AI in or­der to get good an­swers for em­piri­cal ques­tions (such as AGI timelines) that can help avoid situ­a­tions like info-cas­cades. While most ques­tions don’t have that many pre­dic­tions, the cur­rent set of beta-users were in­vited based on fore­cast­ing track-record and AI do­main-ex­per­tise, so the sig­nal of the av­er­age fore­cast should be high.

Some in­ter­est­ing pre­dic­tions in­clude:

- By end of 2019, will there be an agent at least as good as AlphaS­tar us­ing non-con­tro­ver­sial, hu­man-like APM re­stric­tions? [mean: 58%, me­dian: 66%, n = 26]

- When will there be a su­per­hu­man Star­craft II agent with no do­main-spe­cific hard­coded knowl­edge, trained us­ing <=$10,000 of pub­li­cly available com­pute? [50%: 2021 to 2037, with me­dian 2026, n = 35]

This fore­cast is sup­ported by a Guessti­mate model, which es­ti­mates cur­rent and fu­ture sam­ple effi­ciency of Star­craft II al­gorithms, based on cur­rent perfor­mance, al­gorith­mic progress, and the gen­er­al­iza­tion of Moore’s law. For al­gorith­mic progress, they look at the im­prove­ment in sam­ple effi­ciency on Atari, and find a dou­bling time of roughly a year, via DQN --> DDQN --> Duel­ing DDQN --> Pri­ori­tized DDQN --> PPO --> Rain­bow --> IMPALA.

Over­all, there are 50+ ques­tions, in­clud­ing on mal­i­cious use of AI, pub­lish­ing norms, con­fer­ence at­ten­dance, MIRI’s re­search progress, the max com­pute dou­bling trend, OpenAI LP, na­tion­al­i­sa­tion of AI labs, whether fi­nan­cial mar­kets ex­pect AGI, and more. You can sign-up to join here.

AI con­fer­ence at­ten­dance (Katja Grace): This post pre­sents data on at­ten­dance num­bers at AI con­fer­ences. The main re­sult: “to­tal large con­fer­ence par­ti­ci­pa­tion has grown by a fac­tor 3.76 be­tween 2011 and 2019, which is equiv­a­lent to a fac­tor of 1.21 per year dur­ing that pe­riod”. Look­ing at the graph, it seems to me that the ex­po­nen­tial growth started in 2013, which would mean a slightly higher fac­tor of around 1.3 per year. This would also make sense given that the cur­rent boom is of­ten at­tributed to the pub­li­ca­tion of AlexNet in 2012.

Field building

Align­ment Re­search Field Guide (Abram Dem­ski): This post gives ad­vice on how to get started on tech­ni­cal re­search, in par­tic­u­lar by start­ing a lo­cal MIRIx re­search group.

Ro­hin’s opinion: I strongly recom­mend this post to any­one look­ing to get into re­search—it’s a great post; I’m not sum­ma­riz­ing it be­cause I want this newslet­ter to be pri­mar­ily about tech­ni­cal re­search. Even if you are not plan­ning to do the type of re­search that MIRI does, I think this post pre­sents a very differ­ent per­spec­tive on how to do re­search com­pared to the main­stream view in academia. Note though that this is not the ad­vice I’d give to some­one try­ing to pub­lish pa­pers or break into academia. Also, while I’m talk­ing about recom­men­da­tions on how to do re­search, let me also recom­mend Re­search as a Stochas­tic De­ci­sion Pro­cess.

Mis­cel­la­neous (Align­ment)

Par­tial prefer­ences needed; par­tial prefer­ences suffi­cient (Stu­art Arm­strong): I’m not sure I fully un­der­stand this post, but my un­der­stand­ing is that it is say­ing that al­ign­ment pro­pos­als must rely on some in­for­ma­tion about hu­man prefer­ences. Pro­pos­als like im­pact mea­sures and cor­rigi­bil­ity try to for­mal­ize a prop­erty that will lead to good out­comes; but any such for­mal­iza­tion will be de­not­ing some poli­cies as safe and some as dan­ger­ous, and there will always ex­ist a util­ity func­tion ac­cord­ing to which the “safe” poli­cies are catas­trophic. Thus, you need to also define a util­ity func­tion (or a class of them?) that safety is com­puted with re­spect to; and de­sign­ing this is par­tic­u­larly difficult.

Ro­hin’s opinion: This seems very similar to the prob­lem I have with im­pact mea­sures, but I wouldn’t ap­ply that ar­gu­ment to cor­rigi­bil­ity. I think the differ­ence might be that I’m think­ing of “nat­u­ral” things that agents might want, whereas Stu­art is con­sid­er­ing the en­tire space of pos­si­ble util­ity func­tions. I’m not sure what drives this differ­ence.

Un­der­stand­ing Agent In­cen­tives with Causal In­fluence Di­a­grams (Tom Ever­itt et al): This post and as­so­ci­ated pa­per model an agent’s de­ci­sion pro­cess us­ing a causal in­fluence di­a­gram—think of a Bayes net, and then imag­ine that you add nodes cor­re­spond­ing to ac­tions and util­ities. A ma­jor benefit of Bayes nets is that the crite­rion of d-sep­a­ra­tion can be used to de­ter­mine whether two nodes are con­di­tion­ally in­de­pen­dent. Once we add ac­tions and util­ities, we can also an­a­lyze whether ob­serv­ing or in­ter­ven­ing on nodes would lead the agent to achieve higher ex­pected util­ity. The au­thors de­rive crite­ria re­sem­bling d-sep­a­ra­tion for iden­ti­fy­ing each of these cases, which they call ob­ser­va­tion in­cen­tives (for nodes whose value the agent would like to know) and in­ter­ven­tion in­cen­tives (for nodes whose value the agent would like to change). They use ob­ser­va­tion in­cen­tives to show how to an­a­lyze whether a par­tic­u­lar de­ci­sion is fair or not (that is, whether it de­pended on a sen­si­tive fea­ture that should not be used, like gen­der). In­ter­ven­tion in­cen­tives are used to es­tab­lish the se­cu­rity of coun­ter­fac­tual or­a­cles more sim­ply and rigor­ously.

Ro­hin’s opinion: Th­ese crite­ria are the­o­ret­i­cally quite nice, but I’m not sure how they re­late to the broader pic­ture. Is the hope that we will be able to elicit the causal in­fluence di­a­gram an AI sys­tem is us­ing, or some­thing like it? Or per­haps that we will be able to cre­ate a causal in­fluence di­a­gram of the en­vi­ron­ment, and these crite­ria can tell us which nodes we should be par­tic­u­larly in­ter­ested in? Maybe the goal was sim­ply to un­der­stand agent in­cen­tives bet­ter, with the ex­pec­ta­tion that more knowl­edge would help in some as-yet-un­known way? None of these seem very com­pel­ling to me, but the au­thors might have some­thing in mind I haven’t thought of.

Other progress in AI


World Dis­cov­ery Models (Mo­ham­mad Ghesh­laghi Azar, Bilal Piot, Bernardo Avila Pires et al)

Re­in­force­ment learning

Learn­ing Dy­nam­ics Model in Re­in­force­ment Learn­ing by In­cor­po­rat­ing the Long Term Fu­ture (Nan Rose­mary Ke et al)

Deep learning

Self-Tun­ing Net­works: Bilevel Op­ti­miza­tion of Hyper­pa­ram­e­ters us­ing Struc­tured Best-Re­sponse Func­tions (Matthew MacKay, Paul Vi­col et al)

Hier­ar­chi­cal RL

Model Prim­i­tive Hier­ar­chi­cal Lifelong Re­in­force­ment Learn­ing (Bo­han Wu et al)

Mis­cel­la­neous (AI)

The Bit­ter Les­son (Rich Sut­ton): This blog post is con­tro­ver­sial. This is a com­bi­na­tion sum­mary and opinion, and so is more bi­ased than my sum­maries usu­ally are.

Much re­search in AI has been about em­bed­ding hu­man knowl­edge in AI sys­tems, in or­der to use the cur­rent limited amount of com­pute to achieve some out­comes. That is, we try to get our AI sys­tems to think the way we think we think. How­ever, this usu­ally re­sults in sys­tems that work cur­rently, but then can­not lev­er­age the in­creas­ing com­pu­ta­tion that will be available. The bit­ter les­son is that meth­ods like search and learn­ing that can scale to more com­pu­ta­tion even­tu­ally win out, as more com­pu­ta­tion be­comes available. There are many ex­am­ples that will likely be fa­mil­iar to read­ers of this newslet­ter, such as chess (large scale tree search), Go (large scale self play), image clas­sifi­ca­tion (CNNs), and speech recog­ni­tion (Hid­den Markov Models in the 70s, and now deep learn­ing).

Shi­mon White­son’s take is that in re­al­ity lots of hu­man knowl­edge has been im­por­tant in get­ting AI to do things; such as the in­var­i­ances built into con­volu­tional nets, or the MCTS and self-play al­gorithm un­der­ly­ing AlphaZero. I don’t see this as op­posed to Rich Sut­ton’s point—it seems to me that the take­away is that we should aim to build al­gorithms that will be able to lev­er­age large amounts of com­pute, but we can be clever and em­bed im­por­tant knowl­edge in such al­gorithms. I think this crite­rion would have pre­dicted ex-ante (i.e. be­fore see­ing the re­sults) that much past and cur­rent re­search in AI was mis­guided, with­out also pre­dict­ing that any of the ma­jor ad­vances (like CNNs) were mis­guided.

It’s worth not­ing that this is com­ing from a per­spec­tive of aiming for the most gen­eral pos­si­ble ca­pa­bil­ities for AI sys­tems. If your goal is to in­stead build some­thing that works re­li­ably now, then it re­ally is a good idea to em­bed hu­man do­main knowl­edge, as it does lead to a perfor­mance im­prove­ment—you should just ex­pect that in time the sys­tem will be re­placed with a bet­ter perform­ing sys­tem with less em­bed­ded hu­man knowl­edge.

One dis­agree­ment I have is that this post doesn’t ac­knowl­edge the im­por­tance of data. The AI ad­vances we see now are ones where the data has been around for a long time (or you use simu­la­tion to get the data), and some­one fi­nally put in enough en­g­ineer­ing effort + com­pute to get the data out and put it in a big enough model. That is, cur­rently com­pute is in­creas­ing much faster (AN #7) than data, so the break­throughs you see are in do­mains where the bot­tle­neck was com­pute and not data; that doesn’t mean data bot­tle­necks don’t ex­ist.


AI Safety work­shop at IJCAI 2019 (Huáscar Espinoza et al): There will be a work­shop on AI safety at IJCAI 2019 in Ma­cao, China; the pa­per sub­mis­sion dead­line is April 12. In ad­di­tion to the stan­dard sub­mis­sions (tech­ni­cal pa­pers, pro­pos­als for tech­ni­cal talks, and po­si­tion pa­pers), they are seek­ing pa­pers for their “AI safety land­scape” ini­ti­a­tive, which aims to build a sin­gle doc­u­ment iden­ti­fy­ing the core knowl­edge and needs of the AI safety com­mu­nity.