The Goodhart Game

In this pa­per, we ar­gue that ad­ver­sar­ial ex­am­ple defense pa­pers have, to date, mostly con­sid­ered ab­stract, toy games that do not re­late to any spe­cific se­cu­rity con­cern. Fur­ther­more, defense pa­pers have not yet pre­cisely de­scribed all the abil­ities and limi­ta­tions of at­tack­ers that would be rele­vant in prac­ti­cal se­cu­rity.

From the ab­stract of Mo­ti­vat­ing the Rules of the Game for Ad­ver­sar­ial Ex­am­ple Re­search by Gilmer et al (sum­mary)

Ad­ver­sar­ial ex­am­ples have been great for get­ting more ML re­searchers to pay at­ten­tion to al­ign­ment con­sid­er­a­tions. I per­son­ally have spent a fair of time think­ing about ad­ver­sar­ial ex­am­ples, I think the topic is fas­ci­nat­ing, and I’ve had a num­ber of ideas for ad­dress­ing them. But I’m also not ac­tu­ally sure work­ing on ad­ver­sar­ial ex­am­ples is a good use of time. Why?

Like Gilmer et al, I think ad­ver­sar­ial ex­am­ples are un­der­mo­ti­vated… and over­rated. Peo­ple in the al­ign­ment com­mu­nity like to make an anal­ogy be­tween ad­ver­sar­ial ex­am­ples and Good­hart’s Law, but I think this anal­ogy fails to be more than an in­tu­ition pump. With Good­hart’s Law, there is no “ad­ver­sary” at­tempt­ing to se­lect an in­put that the AI does par­tic­u­larly poorly on. In­stead, the AI it­self is se­lect­ing an in­put in or­der to max­i­mize some­thing. Could the in­put the AI se­lects be an in­put that the AI does poorly on? Sure. But I don’t think the com­mon­al­ity goes much deeper than “there are parts of the in­put space that the AI does poorly on”. In other words, clas­sifi­ca­tion er­ror is still a thing. (Maybe both ad­ver­saries and op­ti­miza­tion tend to push us off the part of the dis­tri­bu­tion our model performs well on. OK, dis­tri­bu­tional shift is still a thing.)

To re­peat a point made by the au­thors, if your model has any clas­sifi­ca­tion er­ror at all, it’s the­o­ret­i­cally vuln­er­a­ble to ad­ver­saries. Sup­pose you have a model that’s 99% ac­cu­rate and I have an un­cor­re­lated model that’s 99.9% ac­cu­rate. Sup­pose I have ac­cess to your model. Then I can search the in­put space for a case where your model and mine dis­agree. Since my model is more ac­cu­rate, ~10 times out of 11 the in­put will cor­re­spond to an “ad­ver­sar­ial” at­tack on your model. From a philo­soph­i­cal per­spec­tive, solv­ing ad­ver­sar­ial ex­am­ples ap­pears to be es­sen­tially equiv­a­lent to get­ting 100% ac­cu­racy on ev­ery prob­lem. In the limit, ad­dress­ing ad­ver­sar­ial ex­am­ples in a fully satis­fac­tory way looks a bit like solv­ing AGI.

At the same time, met­rics have taken us a long way in AI re­search, whether those met­rics are abil­ity to with­stand hu­man-crafted ad­ver­sar­ial ex­am­ples or score well on ImageNet. So what would a met­ric which hits the AI al­ign­ment prob­lem a lit­tle more squarely look like? How could we mea­sure progress on solv­ing Good­hart’s Law in­stead of a prob­lem that’s vaguely analo­gous?

Let’s start sim­ple. You sub­mit an AI pro­gram. Your pro­gram gets some la­beled data from a real-val­ued func­tion to max­i­mize (stand­ing in for “la­beled data about the op­er­a­tor’s true util­ity func­tion”). It figures out where it thinks the max­i­mum of the func­tion is and makes its guess. Score is based on re­gret: the func­tion’s true max­i­mum minus the func­tion value at the alleged max­i­mum.

We can make things more in­ter­est­ing. Sup­pose the real-val­ued func­tion has both pos­i­tive and nega­tive out­puts. Sup­pose most out­puts of the real-val­ued func­tion are nega­tive (in the same way most ran­dom ac­tions a pow­er­ful AI sys­tem could take would be nega­tive from our per­spec­tive). And the AI sys­tem gets the op­tion to ab­stain from ac­tion, which yields a score of 0. Now there’s more of an in­cen­tive to find an in­put which is “ac­cept­able” with high prob­a­bil­ity, and ab­stain if in doubt.

Maybe the la­beled data gets the true util­ity func­tion wrong in im­por­tant ways. We can add noise to the data some­how be­fore pass­ing it to our AI sys­tem to simu­late this. Per­haps some out­puts can be as­signed com­pletely at ran­dom.

Even with noise, the best strat­egy might be to just se­lect the in­put from the la­beled data that pro­duces the largest pos­i­tive out­put. But this isn’t nec­es­sar­ily fea­si­ble for a real AGI. If la­beled data cor­re­sponds to de­sired and un­de­sired be­hav­ior for our AGI, it prob­a­bly won’t work to just ex­e­cute the “most de­sired” be­hav­ior from the la­beled dataset, be­cause that “most de­sired” be­hav­ior will be con­tin­gent on a par­tic­u­lar set of cir­cum­stances.

To simu­late this, we can add dis­tri­bu­tional shift to the challenge. Provide some la­beled data, then spec­ify a fea­si­ble re­gion that may con­tain lit­tle or no la­beled data. Ask the AI for the best in­put in the fea­si­ble re­gion. There’s a straight­for­ward anal­ogy to se­lect­ing an ac­tion here, if you think of the in­put space as a vec­tor where info de­scribing the cir­cum­stances is con­cate­nated with info de­scribing an ac­tion. We could choose a fea­si­ble re­gion where the “cir­cum­stances” part of the vec­tor is fixed, and there are a limited num­ber of “ac­tions” available to choose be­tween.

If we re­strict the fea­si­ble re­gion to two points, and the two points are cho­sen ad­ver­sar­i­ally by in­spect­ing the AI’s model, the tra­di­tional ad­ver­sar­ial ex­am­ples sce­nario can be seen as an un­usual spe­cial case of this frame­work. I think this illus­trates how ar­tifi­cial the “ad­ver­sar­ial ex­am­ples” sce­nario is from a Good­hart’s Law per­spec­tive, be­cause the uni­verse is rarely per­verse enough to de­liber­ately re­strict us to only two op­tions, one of which has been spe­cially cho­sen to de­ceive us. Although the broad­ness of ac­tion space has been cited as a rea­son why Good­hart’s Law oc­curs, I sus­pect a broader ac­tion space ac­tu­ally makes Good­hart’s Law eas­ier to over­come if you use it well—for ex­am­ple, by sim­ply se­lect­ing the la­beled in­put with the best out­put, as de­scribed above.

To simu­late an AI mak­ing queries of its hu­man op­er­a­tor, you could al­low the sys­tem a limited num­ber of queries to the func­tion it is try­ing to max­i­mize. How­ever, we prob­a­bly should not fo­cus on this ac­tive learn­ing sce­nario to the ex­clu­sion of all else—to pre­vent show­ing the op­er­a­tor info haz­ards, we’ll want a model for “what con­sti­tutes an info haz­ard”, and that model can’t eas­ily be trained by ac­tive learn­ing with­out show­ing the op­er­a­tor in­stances of info haz­ards! (Well, maybe if they were re­stricted to mild info haz­ards...)

Where does the func­tion to max­i­mize come from? I see two op­tions: peo­ple func­tions and com­puter func­tions. For peo­ple func­tions, you could use taskers to eval­u­ate the com­puter’s out­put. There’s already been work on gen­er­at­ing cat pic­tures, which could be seen as an at­tempt to max­i­mize the per­son func­tion “how much does this image look like a cat”. But ideas from this post could still be ap­plied to such a prob­lem. For ex­am­ple, to add dis­tri­bu­tional shift, you could find a weird cat pic­ture, then fix a bunch of the weirder pix­els on it as the “fea­si­ble re­gion”, leave the other pix­els unas­signed, and see if an AI sys­tem can re­cover a rea­son­able cat ac­cord­ing to taskers. Can an AI gen­er­ate a black cat af­ter only hav­ing seen tawny cats? What other dis­tri­bu­tional con­straints could be im­posed?

For com­puter func­tions, you’d like to keep your method for gen­er­at­ing the func­tion se­cret, be­cause oth­er­wise con­test par­ti­ci­pants can code their AI sys­tem so it has an in­duc­tive bias to­wards learn­ing the kind of func­tions that you like to use. Also, for com­puter func­tions, you prob­a­bly want to be re­al­is­tic with­out be­ing per­verse. For ex­am­ple, you could have a parabolic func­tion which has a point dis­con­ti­nu­ity at the peak, and that could fool an AI sys­tem that tries to fit a parabola on the data and guess the peak, but this sort of per­ver­sity seems a bit un­likely to show up in real-world sce­nar­ios (un­less we think the func­tion is likely to go “off dis­tri­bu­tion” in the re­gion of its true max­i­mum?) Fi­nally, in the same way most ran­dom images are not cats, and most atom con­figu­ra­tions are un­de­sired by hu­mans, most in­puts to your com­puter func­tion should prob­a­bly get a nega­tive score. But in the same way it’s eas­ier for peo­ple to spec­ify what they want than what they don’t want, you might want to im­bal­ance your train­ing dataset to­wards pos­i­tive scores any­way.

To en­sure high re­li­a­bil­ity, we’ll want means by which these prob­lems can be gen­er­ated en masse, to see if we can get the prob­a­bil­ity of e.g. propos­ing an in­put that gets a nega­tive out­put well be­low 0.1%. Luck­ily, for any given func­tion/​dataset pair, it’s pos­si­ble to gen­er­ate a lot of prob­lems just by challeng­ing the AI on differ­ent fea­si­ble re­gions.

Any­way, I think work on this prob­lem will be more ap­pli­ca­ble to real-world AI safety sce­nar­ios than ad­ver­sar­ial ex­am­ples, and it doesn’t seem to me that it re­duces quite as di­rectly to “solve AGI” as ad­ver­sar­ial ex­am­ples work.