Techniques for optimizing worst-case performance

If pow­er­ful ML sys­tems fail catas­troph­i­cally, they may be able to quickly cause ir­re­versible dam­age. To be safe, it’s not enough to have an av­er­age-case perfor­mance guaran­tee on the train­ing dis­tri­bu­tion — we need to en­sure that even if our sys­tems fail on new dis­tri­bu­tions or with small prob­a­bil­ity, they will never fail too badly.

The difficulty of op­ti­miz­ing worst-case perfor­mance is one of the most likely rea­sons that I think pro­saic AI al­ign­ment might turn out to be im­pos­si­ble (if com­bined with an un­lucky em­piri­cal situ­a­tion).

In this post I want to ex­plain my view of the prob­lem and enu­mer­ate some pos­si­ble an­gles of at­tack. My goal is to com­mu­ni­cate why I have hope that worst-case guaran­tees are achiev­able.

None of these are novel pro­pos­als. The in­ten­tion of this post is to ex­plain my view, not to make a new con­tri­bu­tion. I don’t cur­rently work in any of these ar­eas, and so this post should be un­der­stood as an out­sider look­ing in, rather than com­ing from the trenches.

Mal­ign vs. be­nign failures and corrigibility

I want to dis­t­in­guish two kinds of failures:

  • “Benign” failures, where our sys­tem en­coun­ters a novel situ­a­tion, doesn’t know how to han­dle it, and so performs poorly. The re­sult­ing be­hav­ior may sim­ply be er­ratic, or may serve an ex­ter­nal at­tacker. Their effect is similar to phys­i­cal or cy­ber­se­cu­rity vuln­er­a­bil­ities — they cre­ate an op­por­tu­nity for de­struc­tive con­flict but don’t sys­tem­at­i­cally dis­fa­vor hu­man val­ues. They may pose an ex­is­ten­tial risk when com­bined with high-stakes situ­a­tions, in the same way that hu­man in­com­pe­tence may pose an ex­is­ten­tial risk. Although these failures are im­por­tant, I don’t think it is nec­es­sary or pos­si­ble to elimi­nate them in the worst case.

  • “Mal­ign” failures, where our sys­tem con­tinues to be­have com­pe­tently but ap­plies its in­tel­li­gence in the ser­vice of an un­in­tended goal. Th­ese failures sys­tem­at­i­cally fa­vor what­ever goals AI sys­tems tend to pur­sue in failure sce­nar­ios, at the ex­pense of hu­man val­ues. They con­sti­tute an ex­is­ten­tial risk in­de­pen­dent of any other de­struc­tive tech­nol­ogy or dan­ger­ous situ­a­tion. For­tu­nately, they seem both less likely and po­ten­tially pos­si­ble to avoid even in the worst case.

I’m most in­ter­ested in ma­lign failures, and the nar­rower fo­cus is im­por­tant to my op­ti­mism.

The dis­tinc­tion be­tween ma­lign and be­nign failures is not always crisp. For ex­am­ple, sup­pose we try to pre­dict a hu­man’s prefer­ences, then search over all strate­gies to find the one that best satis­fies the pre­dicted prefer­ences. Guess­ing the prefer­ences even a lit­tle bit wrong would cre­ate an ad­ver­sar­ial op­ti­mizer in­cen­tivized to ap­ply its in­tel­li­gence to a pur­pose at odds with our real prefer­ences. If we take this ap­proach, in­com­pe­tence does sys­tem­at­i­cally dis­fa­vor hu­man val­ues.

By aiming for cor­rigible rather than op­ti­mal be­hav­ior (see here or here) I’m op­ti­mistic that it is pos­si­ble to cre­ate a sharper dis­tinc­tion be­tween be­nign and ma­lign failures, which can be lev­er­aged by the tech­niques be­low. But for now, this hope is highly spec­u­la­tive.


I be­lieve that these tech­niques are much more likely to work if we have ac­cess to an over­seer who is sig­nifi­cantly smarter than the model that we are try­ing to train. I hope that am­plifi­ca­tion makes this pos­si­ble.

It seems re­al­is­tic for a strong over­seer to rec­og­nize an (in­put, out­put) pair as a ma­lign failure mode (though it may re­quire a solu­tion to in­formed over­sight). So now we have a con­crete goal: find a model that never gives an out­put the over­seer would di­ag­nose as catas­troph­i­cally bad.

His­tor­i­cally re­searchers in the AI safety com­mu­nity have been ex­tremely pes­simistic about re­li­a­bil­ity. I think part of that pes­simism is be­cause they have been imag­in­ing work­ing with mod­els much smarter than the over­seer.


I’ll de­scribe three cat­e­gories of tech­niques:

  • Ad­ver­sar­ial training

  • Verification

  • Transparency

Pre­vi­ous ver­sions of this list in­cluded im­plicit en­sem­bles, e.g. Bayesian neu­ral nets. I still think there might be use­ful tech­niques in that space, but I’ve be­come pes­simistic enough that it no longer makes the list.

Ad­ver­sar­ial training

(Ex­am­ples: Ex­plain­ing and Har­ness­ing Ad­ver­sar­ial Ex­am­ples, Ad­ver­sar­ial Ma­chine Learn­ing at Scale.)

If we want to know whether our model ever fails catas­troph­i­cally, it’s nat­u­ral to train an ad­ver­sary to search for catas­tro­phe-in­duc­ing in­puts. If the ad­ver­sary can’t find one, then we hope that none ex­ist. In or­der to pro­duce a catas­tro­phe-free model, we can in­te­grate the re­sults of test­ing into the train­ing pro­cess.

The ba­sic prob­lem with this strat­egy is that no mat­ter how hard you test, it’s always pos­si­ble (re­al­is­ti­cally, it’s likely) that there are pos­si­ble failure cases your ad­ver­sary didn’t no­tice. For ex­am­ple, sup­pose that it takes 2ᴺ sec­onds to find the Nth failure. Be­cause na­ture will be many times more com­plex than our train­ing pro­ce­dure, it’s in­evitable that there are many types of failures that na­ture can find but we can’t.

We can at­tack this prob­lem by try­ing to make the ad­ver­sary’s job eas­ier. Some pos­si­bil­ities:

  • Use in­ter­pretabil­ity tech­niques (see be­low) to help the ad­ver­sary iden­tify pos­si­ble failures.

  • Use in­ter­pretabil­ity to ex­pand what counts as catas­trophic — if the ad­ver­sary can get the model to think in a way that looks like it could lead to catas­trophic be­hav­ior un­der differ­ent situ­a­tions, that could count as a suc­cess for the ad­ver­sary.

  • Com­bine with ver­ifi­ca­tion (see be­low) to help guide the ad­ver­sary’s search and ex­pand the space of al­lowed at­tacks.

  • Allow the ad­ver­sary to di­rectly con­trol some of the in­ter­nal com­pu­ta­tion of our model rather than merely con­trol­ling in­puts. This re­quires a way to con­strain the ac­tions of the ad­ver­sary — we want them to be able to pro­duce any failure that could arise nat­u­rally, but if we give them too much power then they can triv­ially cause a catas­trophic failure.

  • Change the sys­tem’s in­puts so they are eas­ier for the ad­ver­sary to fake: see the “dis­tort­ing ob­ser­va­tions” and “ab­stract­ing ob­ser­va­tions” sec­tions of this post.

  • Make the ad­ver­sary more pow­er­ful, with­out chang­ing their task, by ap­ply­ing am­plifi­ca­tion.

Ad­ver­sar­ial train­ing faces other difficul­ties as well. For ex­am­ple, it might take a long time for ad­ver­sar­ial train­ing to con­verge, or train­ing might never con­verge due to non-con­vex­ity. I think these are sig­nifi­cantly less likely to be deal-break­ers, but they are cer­tainly im­por­tant re­search prob­lems if we ac­tu­ally want to ap­ply ad­ver­sar­ial train­ing.


(Ex­am­ples: Cer­tified Defenses against Ad­ver­sar­ial Ex­am­ples, Relu­plex: An Effi­cient SMT Solver for Ver­ify­ing Deep Neu­ral Net­works)

In ver­ifi­ca­tion the goal is to find a cer­tifi­cate that there is no in­put on which a sys­tem be­haves badly.

Ver­ifi­ca­tion has (at least) two prob­lems:

  • Com­plex­ity. It’s re­ally, re­ally hard to ver­ify prop­er­ties of a com­pu­ta­tion­ally in­ter­est­ing model.

  • Speci­fi­ca­tion. It’s un­clear what we should be try­ing to ver­ify.

Han­dling com­plex­ity is definitely challeng­ing. But if we are free to train the model in or­der to fa­cil­i­tate ver­ifi­ca­tion, and if we are only in­ter­ested in cer­tify­ing some “easy” prop­erty that the model satis­fies with slack, then it’s not clearly doomed.

Prima fa­cie, speci­fi­ca­tion looks more like an un­fix­able deal-breaker. In the rest of this sec­tion I’ll give three pos­si­ble ap­proaches for find­ing speci­fi­ca­tions. I think none of these is satis­fac­tory on their own, but they leave me op­ti­mistic that ver­ifi­ca­tion can be use­ful de­spite speci­fi­ca­tion difficulty.

My cur­rent view is that speci­fi­ca­tion and com­plex­ity are both po­ten­tially sur­mountable difficul­ties, but that it’s likely their re­s­olu­tions will re­sult in par­tial solu­tions that need to be com­bined with other ap­proaches.

Use a slow model of over­seer as a speci­fi­ca­tion. If our over­seer is a phys­i­cal pro­cess, to ac­tu­ally get a speci­fi­ca­tion we’d need to learn a model of the over­seer. So we’ve just shifted the prob­lem to the model of the over­seer.

For­tu­nately, this move does po­ten­tially made the prob­lem one step bet­ter. Be­cause we don’t need to use the over­seer model at run­time, we can af­ford to use a much slower model. So ver­ifi­ca­tion could dis­till a very slow re­li­able model into a fast re­li­able model.

Am­plifi­ca­tion. If our over­seer is pro­duced by am­plify­ing some pro­cess H, and we have a trusted model of H, then we can ob­tain a trusted model of the over­seer. So if we were able to form a trusted model of the first step of am­plifi­ca­tion, then we could iter­a­tively use ver­ifi­ca­tion to con­struct trusted mod­els at each sub­se­quent step.

We are still left with the prob­lem at the first step of am­plifi­ca­tion. But at this step we might be able to re­strict at­ten­tion to a small range of ques­tions for which worst-case guaran­tees are more fea­si­ble; we may be able to train weaker mod­els for which worst-case guaran­tees are more fea­si­ble; or we might be able to elimi­nate the hu­man al­to­gether and be­gin am­plifi­ca­tion from some ex­plicit sim­ple core of rea­son­ing. Any of these seems plau­si­ble.

“Ver­ifi­ca­tion” with­out speci­fi­ca­tion. I find it helpful to think of ver­ifi­ca­tion from the pri­mal/​dual per­spec­tive, which also makes the con­nec­tion to ad­ver­sar­ial train­ing clearer.

In ad­ver­sar­ial train­ing, the ad­ver­sary wants to find an in­put on which the sys­tem be­haves poorly. The dual to the ad­ver­sary is an ex­plicit list of all in­puts, show­ing that each one of them performs well. The dual is ex­po­nen­tially large, and (re­lat­edly) the ad­ver­sary’s prob­lem is ex­po­nen­tially hard.

We can view ver­ifi­ca­tion in a given proof sys­tem as mak­ing the ad­ver­sary’s job eas­ier, by ex­pand­ing the set of al­low­able “at­tacks.” If we do this care­fully, we can make the dual cer­tifi­cate much smaller and make the ad­ver­sary’s task much more tractable.

For ex­am­ple, if our proofs (dual solu­tions) are in first or­der logic, the ad­ver­sary’s task is find­ing any model of the ax­ioms in which there ex­ists an at­tack. (This is only differ­ent in the in­finite case.) If our proofs are sum of squares proofs, the ad­ver­sary’s task is find­ing a pseu­dodis­tri­bu­tion over at­tacks. For cer­tain lin­ear pro­gram­ming proofs, the ad­ver­sary’s goal is find­ing an at­tack on a par­tic­u­lar ex­trap­o­la­tion of the model. And so on.

On this per­spec­tive, we don’t nec­es­sar­ily need an ac­tual speci­fi­ca­tion. In­stead, we can take some ex­ter­nal catas­tro­phe-checker, re­lax it to a catas­tro­phe checker over the ad­ver­sary’s re­laxed ac­tion space, and then try to im­ple­ment an ad­ver­sary against that catas­tro­phe checker. This would change the mean­ing of the dual, which would now have a sub­tle statis­ti­cal in­ter­pre­ta­tion rather than be­ing a tra­di­tional cer­tifi­cate.


(Ex­am­ples: Fea­ture Vi­su­al­iza­tion, Deep In­side Con­volu­tional Net­works)

While train­ing a model, we are free to look in­side and “see what it’s think­ing.” If we could ac­tu­ally un­der­stand those thoughts, we might be able to more eas­ily pre­dict un­der what con­di­tions it would fail ma­lig­nantly.

A (weird) ex­am­ple: sup­pose that hu­man evolu­tion was part of an alien pro­ject to train a math­e­mat­i­cian, and some hu­mans are ex­tracted from our simu­la­tion and trained to solve alien math prob­lems. If aliens are only able to test the hu­mans’ be­hav­ior they might con­clude that they always faith­fully solve what­ever prob­lem they are posed — es­pe­cially af­ter a few gen­er­a­tions of se­lec­tion re­moved all the hu­mans who didn’t do so. But if the aliens were able to look in­side the hu­mans’ heads they might dis­cover that this be­hav­ior is ex­tremely con­tin­gent — per­haps hu­mans are only solv­ing prob­lems be­cause they be­lieve they will die oth­er­wise, and the hu­mans would be­have very differ­ently as soon as they saw an op­por­tu­nity to es­cape.

That ex­am­ple is ex­otic and some­what im­plau­si­ble, but I think that the same ba­sic dy­namic is likely to ap­ply in much messier prac­ti­cal situ­a­tions. If a model would fail ma­lig­nantly un­der some con­di­tions, I think there is some sense in which the model must be test­ing those con­di­tions dur­ing train­ing.

Mal­ign vs. be­nign failures: I think the prospects for us­ing trans­parency to de­tect ma­lign failures in mod­els trained by gra­di­ent de­scent are much bet­ter than the prospects for de­tect­ing be­nign failures, or for de­tect­ing failures in ar­bi­trary mod­els.

The key point is that a ma­lign failure re­quires lev­er­ag­ing the in­tel­li­gence of the model to do some­thing ac­tively bad. If our model is trained by gra­di­ent de­scent, its be­hav­ior can only be in­tel­li­gent when it is ex­er­cised on the train­ing dis­tri­bu­tion — if part of the model never (or very rarely) does any­thing on the train­ing dis­tri­bu­tion, then that part of the model can’t be in­tel­li­gent. So in some sense a ma­lign failure mode needs to use a code path that gets run on the train­ing dis­tri­bu­tion, just un­der differ­ent con­di­tions that cause it to be­have badly.

That means that we don’t nec­es­sar­ily need to an­a­lyze our model in the ab­stract, we can look at its be­hav­ior on ac­tual train­ing data. This looks much more likely to be tractable. (Though this en­tire dis­cus­sion is ex­tremely sub­tle, and with­out be­ing more pre­cise it’s very easy for us to go wrong.)

Am­plifi­ca­tion: Trans­parency will be hard if we ul­ti­mately want to train mod­els much smarter than we are, but deeply un­der­stand­ing what a model is think­ing prob­a­bly re­quires be­ing much smarter than it. I think that prob­lem can po­ten­tially be over­come by us­ing am­plifi­ca­tion to stay ahead of the mod­els we are train­ing. This means that cur­rent work on trans­parency, which con­sid­ers mod­els do­ing tasks that are within the abil­ities of the hu­mans who are try­ing to un­der­stand them, could po­ten­tially re­main rele­vant even as AI im­proves sig­nifi­cantly.

What do you do with trans­parency? Merely un­der­stand­ing that a model might be­have catas­troph­i­cally could be use­ful, but it would be much nicer to ac­tu­ally fix the prob­lem. Ad­ver­sar­ial train­ing gives a nat­u­ral mechanism: once we un­der­stand a failure we can syn­the­size ap­pro­pri­ate data and then train on that data.

This ap­proach puts sig­nifi­cantly more stress on our trans­parency tech­niques. Even if were ini­tially able to use trans­parency to see how our model might fail, af­ter we perform many gen­er­a­tions of se­lec­tion we might weed out ex­actly the com­pre­hen­si­ble failures and leave the in­com­pre­hen­si­ble ones. You would only want to ap­ply this tech­nique if you had a great deal of faith in your meth­ods; if you were feel­ing at all shaky about your abil­ity to achieve worst-case guaran­tees, and trans­parency tech­niques let you see one po­ten­tial catas­trophic failure, it would be bet­ter to con­sider that a near-miss and se­ri­ously re­work your pro­ject rather than plow­ing on.


Mak­ing ML sys­tems work in the worst case is hard, even if we are only con­cerned with ma­lign failures and have ac­cess to an over­seer who can iden­tify them. If we can’t solve this prob­lem, I think it se­ri­ously calls into ques­tion the fea­si­bil­ity of al­igned ML.

For­tu­nately there are at least a few plau­si­ble an­gles of at­tack on this prob­lem. All of these ap­proaches feel very difficult, but I don’t think we’ve run into con­vinc­ing deal-break­ers. I also think these ap­proaches are com­ple­men­tary, which makes it feel even more plau­si­ble that they (or their de­scen­dants) will even­tu­ally be suc­cess­ful. I think that ex­plor­ing these an­gles of at­tack, and iden­ti­fy­ing new ap­proaches, should be a pri­or­ity for re­searchers in­ter­ested in al­ign­ment.

This was origi­nally posted here on 1st Fe­bru­ary, 2018.

The next post in this se­quence is “Reli­a­bil­ity Am­plifi­ca­tion”, and will come out on Tues­day.