Alignment Newsletter #53

Link post

Find all Align­ment Newslet­ter re­sources here. In par­tic­u­lar, you can sign up, or look through this spread­sheet of all sum­maries that have ever been in the newslet­ter.

Cody Wild is now con­tribut­ing sum­maries to the newslet­ter!


Align­ment Newslet­ter One Year Ret­ro­spec­tive (Ro­hin Shah): The Align­ment Newslet­ter is one year old! I’ve writ­ten a ret­ro­spec­tive of the newslet­ter’s im­pact over the last year, with a lot of open ques­tions about what the newslet­ter should look like in the fu­ture. Please help me figure out the an­swers by tak­ing this 3-minute sur­vey, and if you’re feel­ing par­tic­u­larly gen­er­ous with your time, read the ret­ro­spec­tive and tell me your opinions in the com­ments!

Are Deep Neu­ral Net­works Dra­mat­i­cally Overfit­ted? (Lilian Weng): The con­cepts of un­der­fit­ting and overfit­ting, and their re­la­tion to the bias-var­i­ance trade­off, are fun­da­men­tal to stan­dard ma­chine learn­ing the­ory. Roughly, for a fixed amount of data, there is an op­ti­mal model com­plex­ity for learn­ing from that data: any less com­plex and the model won’t be able to fit the data, and any more com­plex and it will overfit to noise in the data. This means that as you in­crease model com­plex­ity, train­ing er­ror will go down to zero, but val­i­da­tion er­ror will go down and then start turn­ing back up once the model is overfit­ting.

We know that neu­ral net­works are much more ex­pres­sive than the the­ory would pre­dict is op­ti­mal, both from the­o­rems show­ing that neu­ral net­works can learn any func­tion (in­clud­ing one that pro­vides a rather tight bound on num­ber of pa­ram­e­ters), as well as a pa­per show­ing that neu­ral nets can learn ran­dom noise. Yet they work well in prac­tice, achiev­ing good within-dis­tri­bu­tion gen­er­al­iza­tion.

The post starts with a brief sum­mary of top­ics that read­ers of this newslet­ter are prob­a­bly fa­mil­iar with: Oc­cam’s ra­zor, the Min­i­mum De­scrip­tion Length prin­ci­ple, Kol­mogorov Com­plex­ity, and Solomonoff In­duc­tion. If you don’t know these, I strongly recom­mend learn­ing them if you care about un­der­stand­ing within-dis­tri­bu­tion gen­er­al­iza­tion. The post then looks at a few re­cent in­for­ma­tive pa­pers, and tries to re­pro­duce them.

The first one is the most sur­pris­ing: they find that as you in­crease the model com­plex­ity, your val­i­da­tion er­ror goes down and then back up, as ex­pected, but then at some point it en­ters a new regime and goes down again. How­ever, the au­thor notes that you have to set up the ex­per­i­ments just right to get the smooth curves the pa­per got, and her own at­tempts at re­pro­duc­ing the re­sult are not nearly as dra­matic.

Another pa­per mea­sures the difficulty of a task based on its “in­trin­sic di­men­sion”, which Cody has sum­ma­rized sep­a­rately in this newslet­ter.

The last pa­per looks at what hap­pens if you (a) re­set some layer’s pa­ram­e­ters to the ini­tial pa­ram­e­ters and (b) ran­dom­ize some layer’s pa­ram­e­ters. They find that ran­dom­iz­ing always de­stroys perfor­mance, but re­set­ting to ini­tial pa­ram­e­ters doesn’t make much of a differ­ence for later lay­ers, while be­ing bad for ear­lier lay­ers. This was easy to re­pro­duce, and the find­ings reemerge very clearly.

Ro­hin’s opinion: I’m very in­ter­ested in this prob­lem, and this post does a great job of in­tro­duc­ing it and sum­ma­riz­ing some of the re­cent work. I es­pe­cially ap­pre­ci­ated the at­tempts at re­pro­duc­ing the re­sults.

On the pa­pers them­selves, a regime where you already have ~zero train­ing er­ror but val­i­da­tion er­ror goes down as you in­crease model ex­pres­sivity is ex­ceed­ingly strange. Skim­ming the pa­per, it seems that the idea is that in the nor­mal ML regime, you are only min­i­miz­ing train­ing er­ror—but once you can get the train­ing er­ror to zero, you can then op­ti­mize for the “sim­plest” model with zero train­ing er­ror, which by Oc­cam’s Ra­zor-style ar­gu­ments should be the best one and lead to bet­ter val­i­da­tion perfor­mance. This makes sense in the the­o­ret­i­cal model that they use, but it’s not clear to me how this ap­plies to neu­ral nets, where you aren’t ex­plic­itly op­ti­miz­ing for sim­plic­ity af­ter get­ting zero train­ing er­ror. (Tech­niques like reg­u­lariza­tion don’t re­sult in one-af­ter-the-other op­ti­miza­tion—you’re op­ti­miz­ing for both sim­plic­ity and low train­ing er­ror si­mul­ta­neously, so you wouldn’t ex­pect this crit­i­cal point at which you en­ter a new regime.) So I still don’t un­der­stand these re­sults. That said, given the difficulty with re­pro­duc­ing them, I’m not go­ing to put too much weight on these re­sults now.

I tried to pre­dict the re­sults of the last pa­per and cor­rectly pre­dicted that ran­dom­iz­ing would always de­stroy perfor­mance, but pre­dicted that re­set­ting to ini­tial­iza­tion would be okay for early lay­ers in­stead of later lay­ers. I had a cou­ple of rea­sons for the wrong pre­dic­tion. First, there had been a few pa­pers that showed good re­sults even with ran­dom fea­tures, sug­gest­ing the ini­tial lay­ers aren’t too im­por­tant, and so maybe don’t get up­dated too much. Se­cond, the gra­di­ent of the loss w.r.t later lay­ers re­quires only a few back­prop­a­ga­tion steps, and so prob­a­bly pro­vides a clear, con­sis­tent di­rec­tion mov­ing it far away from the ini­tial con­figu­ra­tion, while the gra­di­ent w.r.t ear­lier lay­ers fac­tors through the later lay­ers which may have weird or wrong val­ues and so might push in an un­usual di­rec­tion that might get can­cel­led out across mul­ti­ple gra­di­ent up­dates. I skimmed the pa­per and it doesn’t re­ally spec­u­late on why this hap­pens, and my thoughts still seem rea­son­able to me, so this is an­other fact that I have yet to ex­plain.

Tech­ni­cal AI alignment

Tech­ni­cal agen­das and prioritization

Sum­mary of the Tech­ni­cal Safety Work­shop (David Krueger) (sum­ma­rized by Richard): David iden­ti­fies two broad types of AI safety work: hu­man in the loop ap­proaches, and the­ory ap­proaches. A no­table sub­set of the former cat­e­gory is meth­ods which im­prove our abil­ity to give ad­vanced sys­tems mean­ingful feed­back—this in­cludes de­bate, IDA, and re­cur­sive re­ward mod­el­ing. CIRL and CAIS are also hu­man-in-the-loop. Mean­while the the­ory cat­e­gory in­cludes MIRI’s work on agent foun­da­tions; side effect met­rics; and ver­ified box­ing.

Iter­ated amplification

A Con­crete Pro­posal for Ad­ver­sar­ial IDA (Evan Hub­inger): This post pre­sents a method to use an ad­ver­sary to im­prove the sam­ple effi­ciency (with re­spect to hu­man feed­back) of iter­ated am­plifi­ca­tion. The key idea is that when a ques­tion is de­com­posed into sub­ques­tions, the ad­ver­sary is used to pre­dict which sub­ques­tion the agent will do poorly on, and the hu­man is only asked to re­solve that sub­ques­tion. In ad­di­tion to im­prov­ing sam­ple effi­ciency by only ask­ing rele­vant ques­tions, the re­sult­ing ad­ver­sary can also be used for in­ter­pretabil­ity: for any ques­tion-an­swer pair, the ad­ver­sary can pick out spe­cific sub­ques­tions in the tree that are par­tic­u­larly likely to con­tain er­rors, which can then be re­viewed.

Ro­hin’s opinion: I like the idea, but the math in the post is quite hard to read (mainly due to the lack of ex­po­si­tion). The post also has sep­a­rate pro­ce­dures for am­plifi­ca­tion, dis­til­la­tion and iter­a­tion; I think they can be col­lapsed into a sin­gle more effi­cient pro­ce­dure, which I wrote about in this com­ment.

Learn­ing hu­man intent

Con­di­tional re­vealed prefer­ence (Jes­sica Tay­lor): When back­ing out prefer­ences by look­ing at peo­ple’s ac­tions, you may find that even though they say they are op­ti­miz­ing for X, their ac­tions are bet­ter ex­plained as op­ti­miz­ing for Y. This is bet­ter than rely­ing on what they say, at least if you want to pre­dict what they will do in the fu­ture. How­ever, all such in­fer­ences are spe­cific to the cur­rent con­text. For ex­am­ple, you may in­fer that schools are “about” deal­ing with au­thor­i­tar­ian work en­vi­ron­ments, as op­posed to learn­ing—but maybe this is be­cause ev­ery­one who de­signs schools doesn’t re­al­ize what the most effec­tive meth­ods of teach­ing-for-learn­ing are, and if they were con­vinced that some other method was bet­ter for learn­ing they would switch to that. So, in or­der to figure out what peo­ple “re­ally want”, we need to see not only what they do in the cur­rent con­text, but also what they would do in a range of al­ter­na­tive sce­nar­ios.

Ro­hin’s opinion: The gen­eral point here, which comes up pretty of­ten, is that any in­for­ma­tion you get about “what hu­mans want” is go­ing to be spe­cific to the con­text in which you elicit that in­for­ma­tion. This post makes that point when the in­for­ma­tion you get is the ac­tions that peo­ple take. Some other in­stances of this point:

- In­verse Re­ward De­sign notes that a hu­man-pro­vided re­ward func­tion should be treated as spe­cific to the train­ing en­vi­ron­ment, in­stead of as a de­scrip­tion of good be­hav­ior in all pos­si­ble en­vi­ron­ments.

- CP-Nets are based on the point that when a hu­man says “I want X” it is not a state­ment that is meant to hold in all pos­si­ble con­texts. They pro­pose very weak se­man­tics, where “I want X” means “hold­ing ev­ery other as­pect of the world con­stant, it would be bet­ter for X to be pre­sent than for it not to be pre­sent”.

- Wei Dai’s point (AN #37) that hu­mans likely have ad­ver­sar­ial ex­am­ples, and we should not ex­pect prefer­ences to gen­er­al­ize un­der dis­tri­bu­tion shift.

- Stu­art Arm­strong and Paul Chris­ti­ano have made or ad­dressed this point in many of their posts.

Defeat­ing Good­hart and the clos­est un­blocked strat­egy prob­lem (Stu­art Arm­strong): One is­sue with the idea of re­ward un­cer­tainty (AN #42) based on a model of un­cer­tainty that we spec­ify is that we tend to severely un­der­es­ti­mate how un­cer­tain we should be. This post makes the point that we could try to build an AI sys­tem that starts with this es­ti­mate of our un­cer­tainty, but then cor­rects the es­ti­mate based on its un­der­stand­ing of hu­mans. For ex­am­ple, if it no­tices that hu­mans tend to be­come much more un­cer­tain when pre­sented with some cru­cial con­sid­er­a­tion, it could re­al­ize that its es­ti­mate prob­a­bly needs to be widened sig­nifi­cantly.

Ro­hin’s opinion: So far, this is an idea that hasn’t been turned into a pro­posal yet, so it’s hard to eval­u­ate. The most ob­vi­ous im­ple­men­ta­tion (to me) would in­volve an ex­plicit es­ti­mate of re­ward un­cer­tainty, and then an ex­plicit model for how to up­date that un­cer­tainty (that would not be Bayes Rule, since that would nar­row the un­cer­tainty over time). At this point it’s not clear to me why we’re even us­ing the ex­pected util­ity for­mal­ism; it feels like adding epicy­cles in or­der to get a sin­gle par­tic­u­lar be­hav­ior that breaks other things. You could also make the ar­gu­ment that there will be mis­speci­fi­ca­tion of the model of how up­date the un­cer­tainty. But again, this is just the most ob­vi­ous com­ple­tion of the idea; it’s plau­si­ble that there’s a differ­ent way of do­ing this that’s bet­ter.

Par­ent­ing: Safe Re­in­force­ment Learn­ing from Hu­man In­put (Christo­pher Frye et al)


At­ten­tion is not Ex­pla­na­tion (Sarthak Jain et al) (sum­ma­rized by Richard): This pa­per ex­plores the use­ful­ness of at­ten­tion weights in in­ter­pret­ing neu­ral net­works’ perfor­mance on NLP tasks. The au­thors pre­sent two find­ings: firstly, that at­ten­tion weights are only weakly cor­re­lated with other met­rics of word im­por­tance; and sec­ondly, that there of­ten ex­ist ad­ver­sar­i­ally-gen­er­ated at­ten­tion weights which are to­tally differ­ent from the learned weights, but which still lead to the same out­puts. They con­clude that these re­sults un­der­mine the ex­plana­tory rele­vance of at­ten­tion weights.

Richard’s opinion: I like this type of in­ves­ti­ga­tion, but don’t find their ac­tual con­clu­sions com­pel­ling. In par­tic­u­lar, it doesn’t mat­ter whether “mean­ingless” ad­ver­sar­ial at­ten­tion weights can lead to the same clas­sifi­ca­tions, as long as the ones ac­tu­ally learned by the sys­tem are in­ter­pretable. Also, the lack of cor­re­la­tion be­tween at­ten­tion weights and other meth­ods could be ex­plained ei­ther by at­ten­tion weights be­ing much worse than the other meth­ods, or much bet­ter, or merely use­ful for differ­ent pur­poses.

Right for the Right Rea­sons: Train­ing Differ­en­tiable Models by Con­strain­ing their Ex­pla­na­tions (An­drew Ross et al)

Ad­ver­sar­ial examples

The LogBar­rier ad­ver­sar­ial at­tack: mak­ing effec­tive use of de­ci­sion bound­ary in­for­ma­tion (Chris Fin­lay et al) (sum­ma­rized by Dan H): Rather than max­i­miz­ing the loss of a model given a per­tur­ba­tion bud­get, this pa­per min­i­mizes the per­tur­ba­tion size sub­ject to the con­straint that the model mis­clas­sify the ex­am­ple. This mis­clas­sifi­ca­tion con­straint is en­forced by adding a log­a­r­ith­mic bar­rier to the ob­jec­tive, which they pre­vent from caus­ing a loss ex­plo­sion through through a few clever tricks. Their at­tack ap­pears to be faster than the Car­lini-Wag­ner at­tack.

Read more: The code is here.


Gra­di­ent Des­cent with Early Stop­ping is Prov­ably Ro­bust to La­bel Noise for Over­pa­ram­e­ter­ized Neu­ral Net­works (Mingchen Li et al.) (sum­ma­rized by Dan H): Pre­vi­ous em­piri­cal pa­pers have shown that find­ing ways to de­crease train­ing time greatly im­proves ro­bust­ness to la­bel cor­rup­tions, but to my knowl­edge this is the first the­o­ret­i­cal treat­ment.

Other progress in AI

Deep learning

Mea­sur­ing the In­trin­sic Di­men­sion of Ob­jec­tive Land­scapes (Chun­yuan Li et al) (sum­ma­rized by Cody): This pa­per pro­poses and defines a quan­tity called “in­trin­sic di­men­sion”, a ge­o­met­ri­cally-in­formed met­ric of how many de­grees of free­dom are ac­tu­ally needed to train a given model on a given dataset. They calcu­late this by pick­ing a set of ran­dom di­rec­tions that span some sub­space of di­men­sion d, and tak­ing gra­di­ent steps only along that lower-di­men­sional sub­space. They con­sider the in­trin­sic di­men­sion of a model and a dataset to be the small­est value d at which perfor­mance reaches 90% of a baseline, nor­mally trained model on the dataset. The ge­o­met­ric in­tu­ition of this ap­proach is that the di­men­sion­al­ity of pa­ram­e­ter space can be, by defi­ni­tion, split into in­trin­sic di­men­sion and its codi­men­sion, the di­men­sion of the solu­tion set. In this fram­ing, higher solu­tion set di­men­sion (and lower in­trin­sic di­men­sion) cor­re­sponds to pro­por­tion­ally more of the search space con­tain­ing rea­son­able solu­tion points, and there­fore a situ­a­tion where a learn­ing agent will be more likely to find such a solu­tion point. There are some in­ter­est­ing ob­ser­va­tions here that cor­re­spond with our in­tu­itions about model train­abil­ity: on MNIST, in­trin­sic di­men­sion­al­ity for a CNN is lower than for a fully con­nected net­work, but if you ran­dom­ize pixel lo­ca­tions, CNN’s in­trin­sic di­men­sion shoots up above FC, match­ing the in­tu­ition that CNNs are ap­pro­pri­ate when their as­sump­tion of lo­cal struc­ture holds.

Cody’s opinion: Over­all, I find this an in­ter­est­ing and well-ar­tic­u­lated pa­per, and am cu­ri­ous to see fu­ture work that ad­dresses some of the ex­trap­o­la­tions and claims im­plied by this pa­per, par­tic­u­larly their claim, sur­pris­ing rel­a­tive to my in­tu­itions, that in­creas­ing n_pa­ram­e­ters will, maybe mono­ton­i­cally, re­duce difficulty of train­ing, be­cause it sim­ply in­creases the di­men­sion­al­ity of the solu­tion set. I’m also not sure how to feel about their sim­ply as­sert­ing that a solu­tion ex­ists when a net­work reaches 90% of baselines perfor­mance, since we may care about that “last mile” perfor­mance and it might also be the harder to reach.

Read more: Paper

No comments.