[AN #62] Are adversarial examples caused by real but imperceptible features?

Link post

Find all Align­ment Newslet­ter re­sources here. In par­tic­u­lar, you can sign up, or look through this spread­sheet of all sum­maries that have ever been in the newslet­ter. I’m always happy to hear feed­back; you can send it to me by com­ment­ing on this post.

Au­dio ver­sion here (may not be up yet).


Call for con­trib­u­tors to the Align­ment Newslet­ter (Ro­hin Shah): I’m look­ing for con­tent cre­ators and a pub­lisher for this newslet­ter! Ap­ply by Septem­ber 6.

Ad­ver­sar­ial Ex­am­ples Are Not Bugs, They Are Fea­tures (An­drew Ilyas, Shibani San­turkar, Dimitris Tsipras, Lo­gan Engstrom et al) (sum­ma­rized by Ro­hin and Cody): Distill pub­lished a dis­cus­sion of this pa­per. This high­lights sec­tion will cover the full dis­cus­sion; all of these sum­maries and opinions are meant to be read to­gether.

Con­sider two pos­si­ble ex­pla­na­tions of ad­ver­sar­ial ex­am­ples. First, they could be caused be­cause the model “hal­lu­ci­nates” a sig­nal that is not use­ful for clas­sifi­ca­tion, and it be­comes very sen­si­tive to this fea­ture. We could call these “bugs”, since they don’t gen­er­al­ize well. Se­cond, they could be caused by fea­tures that do gen­er­al­ize to the test set, but can be mod­ified by an ad­ver­sar­ial per­tur­ba­tion. We could call these “non-ro­bust fea­tures” (as op­posed to “ro­bust fea­tures”, which can’t be changed by an ad­ver­sar­ial per­tur­ba­tion). The au­thors ar­gue that at least some ad­ver­sar­ial per­tur­ba­tions fall into the sec­ond cat­e­gory of be­ing in­for­ma­tive but sen­si­tive fea­tures, based on two ex­per­i­ments.

If the “hal­lu­ci­na­tion” ex­pla­na­tion were true, the hal­lu­ci­na­tions would pre­sum­ably be caused by the train­ing pro­cess, the choice of ar­chi­tec­ture, the size of the dataset, but not by the type of data. So one thing to do would be to see if we can con­struct a dataset such that a model trained on that dataset is already ro­bust, with­out ad­ver­sar­ial train­ing. The au­thors do this in the first ex­per­i­ment. They take an ad­ver­sar­i­ally trained ro­bust clas­sifier, and cre­ate images whose fea­tures (fi­nal-layer ac­ti­va­tions of the ro­bust clas­sifier) match the fea­tures of some un­mod­ified in­put. The gen­er­ated images only have ro­bust fea­tures be­cause the origi­nal clas­sifier was ro­bust, and in fact mod­els trained on this dataset are au­to­mat­i­cally ro­bust.

If the “non-ro­bust fea­tures” ex­pla­na­tion were true, then it should be pos­si­ble for a model to learn on a dataset con­tain­ing only non-ro­bust fea­tures (which will look non­sen­si­cal to hu­mans) and still gen­er­al­ize to a nor­mal-look­ing test set. In the sec­ond ex­per­i­ment (hence­forth WrongLa­bels), the au­thors con­struct such a dataset. Their hy­poth­e­sis is that ad­ver­sar­ial per­tur­ba­tions work by in­tro­duc­ing non-ro­bust fea­tures of the tar­get class. So, to con­struct their dataset, they take an image x with origi­nal la­bel y, ad­ver­sar­i­ally per­turb it to­wards some class y’ to get image x’, and then add (x’, y’) to their dataset (even though to a hu­man x’ looks like class y). They have two ver­sions of this: in RandLa­bels, the tar­get class y’ is cho­sen ran­domly, whereas in DetLa­bels, y’ is cho­sen to be y + 1. For both datasets, if you train a new model on the dataset, you get good perfor­mance on the origi­nal test set, show­ing that the “non-ro­bust fea­tures” do gen­er­al­ize.

Ro­hin’s opinion: I buy this hy­poth­e­sis. It’s a plau­si­ble ex­pla­na­tion for brit­tle­ness to­wards ad­ver­sar­ial noise (“be­cause non-ro­bust fea­tures are use­ful to re­duce loss”), and why ad­ver­sar­ial ex­am­ples trans­fer across mod­els (“be­cause differ­ent mod­els can learn the same non-ro­bust fea­tures”). In fact, the pa­per shows that ar­chi­tec­tures that did worse in Ex­pWrongLa­bels (and so pre­sum­ably are bad at learn­ing non-ro­bust fea­tures) are also the ones to which ad­ver­sar­ial ex­am­ples trans­fer the least. I’ll leave the rest of my opinion to the opinions on the re­sponses.

Read more: Paper and Author response

Re­sponse: Learn­ing from In­cor­rectly La­beled Data (Eric Wal­lace): This re­sponse notes that all of the ex­per­i­ments are of the form: cre­ate a dataset D that is con­sis­tent with a model M; then, when you train a new model M’ on D you get the same prop­er­ties as M. Thus, we can in­ter­pret these ex­per­i­ments as show­ing that model dis­til­la­tion can work even with data points that we would naively think of “in­cor­rectly la­beled”. This is a more gen­eral phe­nomenon: we can take an MNIST model, se­lect only the ex­am­ples for which the top pre­dic­tion is in­cor­rect (la­beled with these in­cor­rect top pre­dic­tions), and train a new model on that—and get non­triv­ial perfor­mance on the origi­nal test set, even though the new model has never seen a “cor­rectly la­beled” ex­am­ple.

Ro­hin’s opinion: I definitely agree that these re­sults can be thought of as a form of model dis­til­la­tion. I don’t think this de­tracts from the main point of the pa­per: the rea­son model dis­til­la­tion works even with in­cor­rectly la­beled data is prob­a­bly be­cause the data is la­beled in such a way that it in­cen­tivizes the new model to pick out the same fea­tures that the old model was pay­ing at­ten­tion to.

Re­sponse: Ro­bust Fea­ture Leak­age (Gabriel Goh): This re­sponse in­ves­ti­gates whether the datasets in WrongLa­bels could have had ro­bust fea­tures. Speci­fi­cally, it checks whether a lin­ear clas­sifier over prov­ably ro­bust fea­tures trained on the WrongLa­bels dataset can get good ac­cu­racy on the origi­nal test set. This shouldn’t be pos­si­ble since WrongLa­bels is meant to cor­re­late only non-ro­bust fea­tures with la­bels. It finds that you can get some ac­cu­racy with RandLa­bels, but you don’t get much ac­cu­racy with DetLa­bels.

The origi­nal au­thors can ac­tu­ally ex­plain this: in­tu­itively, you get ac­cu­racy with RandLa­bels be­cause it’s less harm­ful to choose la­bels ran­domly than to choose them ex­plic­itly in­cor­rectly. With ran­dom la­bels on un­mod­ified in­puts, ro­bust fea­tures should be com­pletely un­cor­re­lated with ac­cu­racy. How­ever, with ran­dom la­bels fol­lowed by an ad­ver­sar­ial per­tur­ba­tion to­wards the la­bel, there can be some cor­re­la­tion, be­cause the ad­ver­sar­ial per­tur­ba­tion can add “a small amount” of the ro­bust fea­ture. How­ever, in DetLa­bels, the la­bels are wrong, and so the ro­bust fea­tures are nega­tively cor­re­lated with the true la­bel, and while this can be re­duced by an ad­ver­sar­ial per­tur­ba­tion, it can’t be re­versed (oth­er­wise it wouldn’t be ro­bust).

Ro­hin’s opinion: The origi­nal au­thors’ ex­pla­na­tion of these re­sults is quite com­pel­ling; it seems cor­rect to me.

Re­sponse: Ad­ver­sar­ial Ex­am­ples are Just Bugs, Too (Pree­tum Nakkiran): The main point of this re­sponse is that ad­ver­sar­ial ex­am­ples can be bugs too. In par­tic­u­lar, if you con­struct ad­ver­sar­ial ex­am­ples that ex­plic­itly don’t trans­fer be­tween mod­els, and then run Ex­pWrongLa­bels with such ad­ver­sar­ial per­tur­ba­tions, then the re­sult­ing model doesn’t perform well on the origi­nal test set (and so it must not have learned non-ro­bust fea­tures).

It also con­structs a data dis­tri­bu­tion where ev­ery use­ful fea­ture of the op­ti­mal clas­sifer is guaran­teed to be ro­bust, and shows that we can still get ad­ver­sar­ial ex­am­ples with a typ­i­cal model, show­ing that it is not just non-ro­bust fea­tures that cause ad­ver­sar­ial ex­am­ples.

In their re­sponse, the au­thors clar­ify that they didn’t in­tend to claim that ad­ver­sar­ial ex­am­ples could not arise due to “bugs”, just that “bugs” were not the only ex­pla­na­tion. In par­tic­u­lar, they say that their main the­sis is “ad­ver­sar­ial ex­am­ples will not just go away as we fix bugs in our mod­els”, which is con­sis­tent with the point in this re­sponse.

Ro­hin’s opinion: Amus­ingly, I think I’m more bullish on the origi­nal pa­per’s claims than the au­thors them­selves. It’s cer­tainly true that ad­ver­sar­ial ex­am­ples can arise from “bugs”: if your model overfits to your data, then you should ex­pect ad­ver­sar­ial ex­am­ples along the overfit­ted de­ci­sion bound­ary. The dataset con­structed in this re­sponse is a par­tic­u­larly clean ex­am­ple: the op­ti­mal clas­sifier would have an ac­cu­racy of 90%, but the model is trained to ac­cu­racy 99.9%, which means it must be overfit­ting.

How­ever, I claim that with large and varied datasets with neu­ral nets, we are typ­i­cally not in the regime where mod­els overfit to the data, and the pres­ence of “bugs” in the model will de­crease. (You cer­tainly can get a neu­ral net to be “buggy”, e.g. by ran­domly la­bel­ing the data, but if you’re us­ing real data with a nat­u­ral task then I don’t ex­pect it to hap­pen to a sig­nifi­cant de­gree.) Nonethe­less, ad­ver­sar­ial ex­am­ples per­sist, be­cause the fea­tures that mod­els use are not the ones that hu­mans use.

It’s also worth not­ing that this ex­per­i­ment strongly sup­ports the hy­poth­e­sis that ad­ver­sar­ial ex­am­ples trans­fer be­cause they are real fea­tures that gen­er­al­ize to the test set.

Re­sponse: Ad­ver­sar­ial Ex­am­ple Re­searchers Need to Ex­pand What is Meant by ‘Ro­bust­ness’ (Justin Gilmer et al): This re­sponse ar­gues that the re­sults in the origi­nal pa­per are sim­ply a con­se­quence of a gen­er­ally ac­cepted prin­ci­ple: “mod­els lack ro­bust­ness to dis­tri­bu­tion shift be­cause they latch onto su­perfi­cial cor­re­la­tions in the data”. This isn’t just about L_p norm ball ad­ver­sar­ial per­tur­ba­tions: for ex­am­ple, one re­cent pa­per shows that if the model is only given ac­cess to high fre­quency fea­tures of images (which look uniformly grey to hu­mans), it can still get above 50% ac­cu­racy. In fact, when we do ad­ver­sar­ial train­ing to be­come ro­bust to L_p per­tur­ba­tions, then the model pays at­ten­tion to differ­ent non-ro­bust fea­tures and be­comes more vuln­er­a­ble to e.g. low-fre­quency fog cor­rup­tion. The au­thors call for ad­ver­sar­ial ex­am­ples re­searchers to move be­yond L_p per­tur­ba­tions and think about the many differ­ent ways mod­els can be frag­ile, and to make them more ro­bust to dis­tri­bu­tional shift.

Ro­hin’s opinion: I strongly agree with the wor­ld­view be­hind this re­sponse, and es­pe­cially the prin­ci­ple they iden­ti­fied. I didn’t know this was a gen­er­ally ac­cepted prin­ci­ple, though of course I am not an ex­pert on dis­tri­bu­tional ro­bust­ness.

One thing to note is what is meant by “su­perfi­cial cor­re­la­tion” here. It means a cor­re­la­tion that re­ally does ex­ist in the dataset, that re­ally does gen­er­al­ize to the test set, but that doesn’t gen­er­al­ize out of dis­tri­bu­tion. A bet­ter term might be “frag­ile cor­re­la­tion”. All of the ex­per­i­ments so far have been look­ing at within-dis­tri­bu­tion gen­er­al­iza­tion (aka gen­er­al­iza­tion to the test set), and are show­ing that non-ro­bust fea­tures do gen­er­al­ize within-dis­tri­bu­tion. This re­sponse is ar­gu­ing that there are many such non-ro­bust fea­tures that will gen­er­al­ize within-dis­tri­bu­tion but will not gen­er­al­ize un­der dis­tri­bu­tional shift, and we need to make our mod­els ro­bust to all of them, not just L_p ad­ver­sar­ial per­tur­ba­tions.

Re­sponse: Two Ex­am­ples of Use­ful, Non-Ro­bust Fea­tures (Gabriel Goh): This re­sponse stud­ies lin­ear fea­tures, since we can an­a­lyt­i­cally com­pute their use­ful­ness and ro­bust­ness. It plots the sin­gu­lar vec­tors of the data as fea­tures, and finds that such fea­tures are ei­ther ro­bust and use­ful, or non-ro­bust and not use­ful. How­ever, you can get use­ful, non-ro­bust fea­tures by en­sem­bling or con­tam­i­na­tion (see re­sponse for de­tails).

Re­sponse: Ad­ver­sar­i­ally Ro­bust Neu­ral Style Trans­fer (Reiichiro Nakano): The origi­nal pa­per showed that ad­ver­sar­ial ex­am­ples don’t trans­fer well to VGG, and that VGG doesn’t tend to learn similar non-ro­bust fea­tures as a ResNet. Separately, VGG works par­tic­u­larly well for style trans­fer. Per­haps since VGG doesn’t cap­ture non-ro­bust fea­tures as well, the re­sults of style trans­fer look bet­ter to hu­mans? This re­sponse and the au­thor’s re­sponse in­ves­ti­gate this hy­poth­e­sis in more de­tail and find that it seems broadly sup­ported, but there are still fin­nicky de­tails to be worked out.

Ro­hin’s opinion: This is an in­trigu­ing em­piri­cal fact. How­ever, I don’t re­ally buy the the­o­ret­i­cal ar­gu­ment that style trans­fer works be­cause it doesn’t use non-ro­bust fea­tures, since I would typ­i­cally ex­pect that a model that doesn’t use L_p-frag­ile fea­tures would in­stead use fea­tures that are frag­ile or non-ro­bust in some other way.

Tech­ni­cal AI alignment


Prob­lems in AI Align­ment that philoso­phers could po­ten­tially con­tribute to (Wei Dai): Ex­actly what it says. The post is short enough that I’m not go­ing to sum­ma­rize it—it would be as long as the origi­nal.

Iter­ated amplification

Del­e­gat­ing open-ended cog­ni­tive work (An­dreas Stuh­lmüller): This is the lat­est ex­pla­na­tion of the ap­proach Ought is ex­per­i­ment­ing with: Fac­tored Eval­u­a­tion (in con­trast to Fac­tored Cog­ni­tion (AN #36)). With Fac­tored Cog­ni­tion, the idea was to re­cur­sively de­com­pose a high-level task un­til you reach sub­tasks that can be di­rectly solved. Fac­tored Eval­u­a­tion still does re­cur­sive de­com­po­si­tion, but now it is aimed at eval­u­at­ing the work of ex­perts, along the same lines as re­cur­sive re­ward mod­el­ing (AN #34).

This shift means that Ought is at­tack­ing a very nat­u­ral prob­lem: how to effec­tively del­e­gate work to ex­perts while avoid­ing prin­ci­pal-agent prob­lems. In par­tic­u­lar, we want to de­sign in­cen­tives such that un­trusted ex­perts un­der the in­cen­tives will be as helpful as ex­perts in­trin­si­cally mo­ti­vated to help. The ex­perts could be hu­man ex­perts or ad­vanced ML sys­tems; ideally our in­cen­tive de­sign would work for both.

Cur­rently, Ought is run­ning ex­per­i­ments with read­ing com­pre­hen­sion on Wikipe­dia ar­ti­cles. The ex­perts get ac­cess to the ar­ti­cle while the judge does not, but the judge can check whether par­tic­u­lar quotes come from the ar­ti­cle. They would like to move to tasks that have a greater gap be­tween the ex­perts and the judge (e.g. al­low­ing the ex­perts to use Google), and to tasks that are more sub­jec­tive (e.g. whether the judge should get Lasik surgery).

Ro­hin’s opinion: The switch from Fac­tored Cog­ni­tion to Fac­tored Eval­u­a­tion is in­ter­est­ing. While it does make it more rele­vant out­side the con­text of AI al­ign­ment (since prin­ci­pal-agent prob­lems abound out­side of AI), it still seems like the ma­jor im­pact of Ought is on AI al­ign­ment, and I’m not sure what the differ­ence is there. In iter­ated am­plifi­ca­tion (AN #30), when de­com­pos­ing tasks in the Fac­tored Cog­ni­tion sense, you would use imi­ta­tion learn­ing dur­ing the dis­til­la­tion step, whereas with Fac­tored Eval­u­a­tion, you would use re­in­force­ment learn­ing to op­ti­mize the eval­u­a­tion sig­nal. The switch would be use­ful if you ex­pect the re­in­force­ment learn­ing to work sig­nifi­cantly bet­ter than imi­ta­tion learn­ing.

How­ever, with Fac­tored Eval­u­a­tion, the agent that you train iter­a­tively is one that must be good at eval­u­at­ing tasks, and then you’d need an­other agent that ac­tu­ally performs the task (or you could train the same agent to do both). In con­trast, with Fac­tored Cog­ni­tion you only need an agent that is perform­ing the task. If the de­com­po­si­tions needed to perform the task are differ­ent from the de­com­po­si­tions needed to eval­u­ate the task, then Fac­tored Cog­ni­tion would pre­sum­ably have an ad­van­tage.

Mis­cel­la­neous (Align­ment)

Clar­ify­ing some key hy­pothe­ses in AI al­ign­ment (Ben Cot­tier et al): This post (that I con­tributed to) in­tro­duces a di­a­gram that maps out im­por­tant and con­tro­ver­sial hy­pothe­ses for AI al­ign­ment. The goal is to help re­searchers iden­tify and more pro­duc­tively dis­cuss their dis­agree­ments.

Near-term concerns

Pri­vacy and security

Eval­u­at­ing and Test­ing Un­in­tended Me­moriza­tion in Neu­ral Net­works (Ni­cholas Car­lini et al)

Read more: The Se­cret Sharer: Eval­u­at­ing and Test­ing Un­in­tended Me­moriza­tion in Neu­ral Networks

Ma­chine ethics

Towards Em­pathic Deep Q-Learn­ing (Bart Buss­mann et al): This pa­per in­tro­duces the em­pathic DQN, which is in­spired by the golden rule: “Do unto oth­ers as you would have them do unto you”. Given a speci­fied re­ward, the em­pathic DQN op­ti­mizes for a weighted com­bi­na­tion of the speci­fied re­ward, and the re­ward that other agents in the en­vi­ron­ment would get if they were a copy of the agent. They show that this re­sults in re­source shar­ing (when there are diminish­ing re­turns to re­sources) and avoid­ing con­flict in two toy grid­wor­lds.

Ro­hin’s opinion: This seems similar in spirit to im­pact reg­u­lariza­tion meth­ods: the hope is that this is a sim­ple rule that pre­vents catas­trophic out­comes with­out hav­ing to solve all of hu­man val­ues.

AI strat­egy and policy

AI Al­gorithms Need FDA-Style Drug Tri­als (Olaf J. Groth et al)

Other progress in AI

Cri­tiques (AI)

Ev­i­dence against cur­rent meth­ods lead­ing to hu­man level ar­tifi­cial in­tel­li­gence (Asya Ber­gal and Robert Long): This post briefly lists ar­gu­ments that cur­rent AI tech­niques will not lead to high-level ma­chine in­tel­li­gence (HLMI), with­out tak­ing a stance on how strong these ar­gu­ments are.


Ought: why it mat­ters and ways to help (Paul Chris­ti­ano): This post dis­cusses the work that Ought is do­ing, and makes a case that it is im­por­tant for AI al­ign­ment (see the sum­mary for Del­e­gat­ing open-ended cog­ni­tive work above). Read­ers can help Ought by ap­ply­ing for their web de­vel­oper role, by par­ti­ci­pat­ing in their ex­per­i­ments, and by donat­ing.

Pro­ject Pro­posal: Con­sid­er­a­tions for trad­ing off ca­pa­bil­ities and safety im­pacts of AI re­search (David Krueger): This post calls for a thor­ough and sys­tem­atic eval­u­a­tion of whether AI safety re­searchers should worry about the im­pact of their work on ca­pa­bil­ities.