Alignment Newsletter #44

Find all Align­ment Newslet­ter re­sources here. In par­tic­u­lar, you can sign up, or look through this spread­sheet of all sum­maries that have ever been in the newslet­ter.


How does Gra­di­ent Des­cent In­ter­act with Good­hart? (Scott Garrabrant): Scott of­ten thinks about op­ti­miza­tion us­ing a sim­ple proxy of “sam­ple N points and choose the one with the high­est value”, where larger N cor­re­sponds to more pow­er­ful op­ti­miza­tion. How­ever, this seems to be a poor model for what gra­di­ent de­scent ac­tu­ally does, and it seems valuable to un­der­stand the differ­ence (or to find out that there isn’t any sig­nifi­cant differ­ence). A par­tic­u­larly in­ter­est­ing sub­ques­tion is whether Good­hart’s Law be­haves differ­ently for gra­di­ent de­scent vs. ran­dom search.

Ro­hin’s opinion: I don’t think that the two meth­ods are very differ­ent, and I ex­pect that if you can con­trol for “op­ti­miza­tion power”, the two meth­ods would be about equally sus­cep­ti­ble to Good­hart’s Law. (In any given ex­per­i­ment, one will be bet­ter than the other, for rea­sons that de­pend on the ex­per­i­ment, but av­er­aged across ex­per­i­ments I don’t ex­pect to see a clear win­ner.) How­ever, I do think that gra­di­ent de­scent is very pow­er­ful at op­ti­miza­tion, and it’s hard to imag­ine the as­tro­nom­i­cally large ran­dom search that would com­pare with it, and so in any prac­ti­cal ap­pli­ca­tion gra­di­ent de­scent will lead to more Good­hart­ing (and more overfit­ting) than ran­dom search. (It will also perform bet­ter, since it won’t un­der­fit, as ran­dom search would.)

One of the an­swers to this ques­tion talks about some ex­per­i­men­tal ev­i­dence, where they find that they can get differ­ent re­sults with a rel­a­tively minor change to the ex­per­i­men­tal pro­ce­dure, which I think is weak ev­i­dence for this po­si­tion.

Trans­former-XL: Un­leash­ing the Po­ten­tial of At­ten­tion Models (Zihang Dai, Zhilin Yang et al): Trans­former ar­chi­tec­tures have be­come all the rage re­cently, show­ing bet­ter perfor­mance on many tasks com­pared to CNNs and RNNs. This post in­tro­duces Trans­former-XL, an im­prove­ment on the Trans­former ar­chi­tec­ture for very long se­quences.

The key idea with the origi­nal Trans­former ar­chi­tec­ture is to use self-at­ten­tion lay­ers to an­a­lyze se­quences in­stead of some­thing re­cur­rent like an RNN, which has prob­lems with van­ish­ing and ex­plod­ing gra­di­ents. An at­ten­tion layer takes as in­put a query q and key-value pairs (K, V). The query q is “com­pared” against ev­ery key k, and that is used to de­cide whether to re­turn the cor­re­spond­ing value v. In their par­tic­u­lar im­ple­men­ta­tion, for each key k, you take the dot product of q and k to get a “weight”, which is then used to re­turn the weighted av­er­age of all of the val­ues. So, you can think of the at­ten­tion layer as tak­ing in a query q, and re­turn­ing the “av­er­age” value cor­re­spond­ing to keys that are “similar” to q (since dot product is a mea­sure of how al­igned two vec­tors are). Typ­i­cally, in an at­ten­tion layer, some sub­set of Q, K and V will be learned. With self-at­ten­tion, Q, K and V are all sourced from the same place—the re­sult of the pre­vi­ous layer (or the in­put if this is the first layer). Of course, it’s not ex­actly the out­put from the pre­vi­ous layer—if that were the case, there would be no pa­ram­e­ters to learn. They in­stead learn three lin­ear pro­jec­tions (i.e. ma­tri­ces) that map from the out­put of the pre­vi­ous layer to Q, K and V re­spec­tively, and then feed the gen­er­ated Q, K and V into a self-at­ten­tion layer to com­pute the fi­nal out­put. And ac­tu­ally, in­stead of hav­ing a sin­gle set of pro­jec­tions, they have mul­ti­ple sets that each con­tain three learned lin­ear pro­jec­tions, that are all then used for at­ten­tion, and then com­bined to­gether for the next layer by an­other learned ma­trix. This is called multi-head at­ten­tion.

Of course, with at­ten­tion, you are treat­ing your data as a set of key-value pairs, which means that the or­der of the key value pairs does not mat­ter. How­ever, the or­der of words in a sen­tence is ob­vi­ously im­por­tant. To al­low the model to make use of po­si­tion in­for­ma­tion, they aug­ment each word and add po­si­tion in­for­ma­tion to it. You could do this just by liter­ally ap­pend­ing a sin­gle num­ber to each word em­bed­ding rep­re­sent­ing its ab­solute po­si­tion, but then it would be hard for the neu­ral net to ask about a word that was “3 words prior”. To make this eas­ier for the net to learn, they cre­ate a vec­tor of num­bers to rep­re­sent the ab­solute po­si­tion based on sinu­soids such that “go back 3 words” can be com­puted by a lin­ear func­tion, which should be easy to learn, and add (not con­cate­nate!) it el­e­men­t­wise to the word em­bed­ding.

This model works great when you are work­ing with a sin­gle sen­tence, where you can at­tend over the en­tire sen­tence at once, but doesn’t work as well when you are work­ing with eg. en­tire doc­u­ments. So far, peo­ple have sim­ply bro­ken up doc­u­ments into seg­ments of a par­tic­u­lar size N and trained Trans­former mod­els over these seg­ments. Then, at test time, for each word, they use the past N − 1 words as con­text and run the model over all N words to get the out­put. This can­not model any de­pen­den­cies that have range larger than N. The Trans­former-XL model fixes this is­sue by tak­ing the seg­ments that vanilla Trans­form­ers use, and adding re­cur­rence. Now, in ad­di­tion to the nor­mal out­put pre­dic­tions we get from seg­ments, we also get as out­put a new hid­den state, that is then passed in to the next seg­ment’s Trans­former layer. This al­lows for ar­bi­trar­ily far long-range de­pen­den­cies. How­ever, this screws up our po­si­tion in­for­ma­tion—each word in each seg­ment is aug­mented with ab­solute po­si­tion in­for­ma­tion, but this doesn’t make sense across seg­ments, since there will now be mul­ti­ple words at (say) po­si­tion 2 -- one for each seg­ment. At this point, we ac­tu­ally want rel­a­tive po­si­tions in­stead of ab­solute ones. They show how to do this—it’s quite cool but I don’t know how to ex­plain it with­out go­ing into the math and this has got­ten long already. Suffice it to say that they look at the in­ter­ac­tion be­tween ar­bi­trary words x_i and x_j, see the terms that arise in the com­pu­ta­tion when you add ab­solute po­si­tion em­bed­dings to each of them, and then change the terms so that they only de­pend on the differ­ence j—i, which is a rel­a­tive po­si­tion.

This new model is state of the art on sev­eral tasks, though I don’t know what the stan­dard bench­marks are here so I don’t know how im­pressed I should be.

Ro­hin’s opinion: It’s quite in­ter­est­ing that even though the point of Trans­former was to get away from re­cur­rent struc­tures, adding them back in leads to sig­nifi­cant im­prove­ments. Of course, the re­cur­rent struc­ture is now at the higher level of seg­ments, rather than at the word or char­ac­ter level. This re­minds me a lot of hi­er­ar­chy—it seems like we’re us­ing the Trans­former as a ba­sic build­ing block that works on the ~sen­tence level so that our RNN-like struc­ture can deal with a higher level of ab­strac­tion (which of course also helps with van­ish­ing/​ex­plod­ing gra­di­ents).

There’s an in­ter­est­ing pat­tern where hi­er­ar­chy and struc­ture seem to be a good in­duc­tive bias, that let you get good perfor­mance with limited com­pute and data, but as those limits sub­side, you’re bet­ter off do­ing some­thing that has less bias. This would pre­dict that as we get more data and com­pute, we would want larger Trans­former mod­els (i.e. longer seg­ments) and less re­cur­rence. It would be in­ter­est­ing to see if that ac­tu­ally holds.

Tech­ni­cal AI alignment

Iter­ated am­plifi­ca­tion sequence

Reli­a­bil­ity am­plifi­ca­tion (Paul Chris­ti­ano): One hope for build­ing an al­igned AI sys­tem is to al­ter­nate ca­pa­bil­ity am­plifi­ca­tion and re­ward en­g­ineer­ing (both AN #42) with semi-su­per­vised learn­ing in or­der to cre­ate an agent that be­comes more and more ca­pa­ble with more iter­a­tions, while re­main­ing al­igned. How­ever, our agents will likely always have a small prob­a­bil­ity of failing, and when we use ca­pa­bil­ity am­plifi­ca­tion, that failure prob­a­bil­ity is also am­plified (since each of the sub­agents could fail). If we do enough ca­pa­bil­ity am­plifi­a­tion, the failure prob­a­bil­ity could ap­proach 1. This mo­ti­vates the prob­lem of re­li­a­bil­ity am­plifi­ca­tion, which aims to take an agent that fails with small prob­a­bil­ity ε and pro­duce a new agent that fails with sig­nifi­cantly lower prob­a­bil­ity. As with ca­pa­bil­ity am­plifi­ca­tion, the new agent can take much more time and com­pute than the origi­nal. The hope would be that the com­bi­na­tion of ca­pa­bil­ity and re­li­a­bil­ity am­plifi­ca­tion would al­low us to take an ini­tial agent and make it both more ca­pa­ble and more re­li­able (even though ca­pa­bil­ity am­plifi­ca­tion re­duces re­li­a­bil­ity, and re­li­a­bil­ity am­plifi­ca­tion will prob­a­bly re­duce ca­pa­bil­ity).

One sim­ple thing to do would be to sam­ple three agents, have them col­lab­o­ra­tively set­tle on an ac­tion. If all three agents are al­igned, we’re fine. The chance of two or three of the agents be­ing ma­lign is small enough that we can ig­nore it. The in­ter­est­ing case is if one agent is ma­lign and two are be­nign. If there were only one cor­rect an­swer, we could sim­ply take a ma­jor­ity vote and get the right an­swer, and we would have am­plified re­li­a­bil­ity to or­der ε^2 (which hap­pens when you get two or more ma­lign agents). How­ever, differ­ent be­nign agents could give differ­ent an­swers, since there is no one “cor­rect” be­nign an­swer. If we were to choose ran­domly, as you might ex­pect we get no benefit: we’d have a ~3ε chance of hav­ing ex­actly one ma­lign agent, and a 13 chance of failing in that case, keep­ing the failure prob­a­bil­ity at ε. The hope for re­duc­ing the failure prob­a­bil­ity is if the two be­nign agents can “rec­og­nize” each other as be­nign, and so that the ma­jor­ity vote has a >2/​3 chance of choos­ing one of the ac­tions that the be­nign agents sug­gested. This does de­pend on all of the agents hav­ing similar ca­pa­bil­ities—if the ma­lign agent is a su­per­in­tel­li­gence while the two be­nign agents are hu­man-level, then it seems likely that the ma­lign su­per­in­tel­li­gence would be able to con­vince the be­nign agents to go along with its plan.

Ro­hin’s opinion: It seems like this re­quires the as­sump­tion that our agents have a small prob­a­bil­ity of failure on any given in­put. I think this makes sense of we are think­ing of re­li­a­bil­ity of cor­rigi­bil­ity (AN #35). That said, I’m pretty con­fused about what prob­lem this tech­nique is try­ing to pro­tect against, which I wrote about in this com­ment.

Value learn­ing sequence

Con­clu­sion to the se­quence on value learn­ing (Ro­hin Shah): This post sum­ma­rizes the value learn­ing se­quence, putting em­pha­sis on par­tic­u­lar parts. I recom­mend read­ing it in full—the se­quence did have an over­ar­ch­ing story, which was likely hard to keep track of over the three months that it was be­ing pub­lished.

Tech­ni­cal agen­das and prioritization

Drexler on AI Risk (Peter McCluskey): This is an­other anal­y­sis of Com­pre­hen­sive AI Ser­vices. You can read my sum­mary of CAIS (AN #40) to get my views.

Re­ward learn­ing theory

One-step hy­po­thet­i­cal prefer­ences and A small ex­am­ple of one-step hy­po­thet­i­cals (Stu­art Arm­strong) (sum­ma­rized by Richard): We don’t hold most of our prefer­ences in mind at any given time—rather, they need to be elic­ited from us by prompt­ing us to think about them. How­ever, a de­tailed prompt could be used to ma­nipu­late the re­sult­ing judge­ment. In this post, Stu­art dis­cusses hy­po­thet­i­cal in­ter­ven­tions which are short enough to avoid this prob­lem, while still caus­ing a hu­man to pass judge­ment on some as­pect of their ex­ist­ing model of the world—for ex­am­ple, be­ing asked a brief ques­tion, or see­ing some­thing on a TV show. He defines a one-step hy­po­thet­i­cal, by con­trast, as a prompt which causes the hu­man to re­flect on a new is­sue that they hadn’t con­sid­ered be­fore. While this data will be fairly noisy, he claims that there will still be use­ful in­for­ma­tion to be gained from it.

Richard’s opinion: I’m not quite sure what over­all point Stu­art is try­ing to make. How­ever, if we’re con­cerned that an agent might ma­nipu­late hu­mans, I don’t see why we should trust it to ag­gre­gate the data from many one-step hy­po­thet­i­cals, since “ma­nipu­la­tion” could then oc­cur us­ing the many de­grees of free­dom in­volved in choos­ing the ques­tions and in­ter­pret­ing the an­swers.

Prevent­ing bad behavior

Ro­bust tem­po­ral differ­ence learn­ing for crit­i­cal do­mains (Richard Klima et al)


How much can value learn­ing be dis­en­tan­gled? (Stu­art Arm­strong) (sum­ma­rized by Richard): Stu­art ar­gues that there is no clear line be­tween ma­nipu­la­tion and ex­pla­na­tion, since even good ex­pla­na­tions in­volve sim­plifi­ca­tion, omis­sions and cherry-pick­ing what to em­pha­sise. He claims that the only differ­ence is that ex­pla­na­tions give us a bet­ter un­der­stand­ing of the situ­a­tion—some­thing which is very sub­tle to define or mea­sure. Nev­er­the­less, we can still limit the effects of ma­nipu­la­tion by ban­ning ex­tremely ma­nipu­la­tive prac­tices, and by giv­ing AIs val­ues that are similar to our own, so that they don’t need to ma­nipu­late us very much.

Richard’s opinion: I think the main point that ex­pla­na­tion and ma­nipu­la­tion can of­ten look very similar is an im­por­tant one. How­ever, I’m not con­vinced that there aren’t any ways of spec­i­fy­ing the differ­ence be­tween them. Other fac­tors which seem rele­vant in­clude what men­tal steps the ex­plainer/​ma­nipu­la­tor is go­ing through, and how they would change if the state­ment weren’t true or if the ex­plainee were sig­nifi­cantly smarter.

Ad­ver­sar­ial examples

The­o­ret­i­cally Prin­ci­pled Trade-off be­tween Ro­bust­ness and Ac­cu­racy (Hongyang Zhang et al) (sum­ma­rized by Dan H): This pa­per won the NeurIPS 2018 Ad­ver­sar­ial Vi­sion Challenge. For ro­bust­ness on CIFAR-10 against l_in­finity per­tur­ba­tions (ep­silon = 8255), it im­proves over the Madry et al. ad­ver­sar­ial train­ing baseline from 45.8% to 56.61%, mak­ing it al­most state-of-the-art. How­ever, it does de­crease clean set ac­cu­racy by a few per­cent, de­spite us­ing a deeper net­work than Madry et al. Their tech­nique has many similar­i­ties to Ad­ver­sar­ial Logit Pairing, which is not cited, be­cause they en­courage the net­work to em­bed a clean ex­am­ple and an ad­ver­sar­ial per­tur­ba­tion of a clean ex­am­ple similarly. I now de­scribe Ad­ver­sar­ial Logit Pairing. Dur­ing train­ing, ALP teaches the net­work to clas­sify clean and ad­ver­sar­i­ally per­turbed points; added to that loss is an l_2 loss be­tween the logit em­bed­dings of clean ex­am­ples and the log­its of the cor­re­spond­ing ad­ver­sar­ial ex­am­ples. In con­trast, in place of the l_2 loss from ALP, this pa­per uses the KL di­ver­gence from the soft­max of the clean ex­am­ple to the soft­max of an ad­ver­sar­ial ex­am­ple. Yet the soft­max dis­tri­bu­tions are given a high tem­per­a­ture, so this loss is not much differ­ent from an l_2 loss be­tween log­its. The other main change in this pa­per is that ad­ver­sar­ial ex­am­ples are gen­er­ated by try­ing to max­i­mize the afore­men­tioned KL di­ver­gence be­tween clean and ad­ver­sar­ial pairs, not by try­ing to max­i­mize the clas­sifi­ca­tion log loss as in ALP. This pa­per then shows that some fur­ther en­g­ineer­ing to ad­ver­sar­ial logit pairing can im­prove ad­ver­sar­ial ro­bust­ness on CIFAR-10.

Field building

The case for build­ing ex­per­tise to work on US AI policy, and how to do it (Niel Bow­er­man): This in-depth ca­reer re­view makes the case for work­ing on US AI policy. It starts by mak­ing a short case for why AI policy is im­por­tant; and then ar­gues that US AI policy roles in par­tic­u­lar can be very im­pact­ful (though they would still recom­mend a policy po­si­tion in an AI lab like Deep­Mind or OpenAI over a US AI policy role). It has tons of use­ful de­tail; the only rea­son I’m not sum­ma­riz­ing it is be­cause I sus­pect that most read­ers are not cur­rently con­sid­er­ing ca­reer choices, and if you are con­sid­er­ing your ca­reer, you should be read­ing the en­tire ar­ti­cle, not my sum­mary. You could also check out Im­port AI’s sum­mary.

Mis­cel­la­neous (Align­ment)

How does Gra­di­ent Des­cent In­ter­act with Good­hart? (Scott Garrabrant): Sum­ma­rized in the high­lights!

Can there be an in­de­scrib­able hel­l­world? (Stu­art Arm­strong) (sum­ma­rized by Richard): This short post ar­gues that it’s always pos­si­ble to ex­plain why any given un­de­sir­able out­come doesn’t satisfy our val­ues (even if that ex­pla­na­tion needs to be at a very high level), and so be­ing able to make su­per­in­tel­li­gences de­bate in a trust­wor­thy way is suffi­cient to make them safe.

AI strat­egy and policy

Bridg­ing near- and long-term con­cerns about AI (Stephen Cave et al)

Sur­vey­ing Safety-rele­vant AI Char­ac­ter­is­tics (Jose Her­nan­dez-Orallo et al)

Other progress in AI

Re­in­force­ment learning

Causal Rea­son­ing from Meta-re­in­force­ment Learn­ing (Ishita Das­gupta et al)

Deep learning

Trans­former-XL: Un­leash­ing the Po­ten­tial of At­ten­tion Models (Zihang Dai, Zhilin Yang et al): Sum­ma­rized in the high­lights!


PAI Fel­low­ship Pro­gram Call For Ap­pli­ca­tions: The Part­ner­ship on AI is open­ing ap­pli­ca­tions for Re­search Fel­lows who will “con­duct ground­break­ing multi-dis­ci­plinary re­search”.

No comments.