Goodhart’s Curse and Limitations on AI Alignment

I be­lieve that most ex­ist­ing pro­pos­als for al­ign­ing AI with hu­man val­ues are un­likely to suc­ceed in the limit of op­ti­miza­tion pres­sure due to Good­hart’s curse. I be­lieve this strongly enough that it con­tinues to sur­prise me a bit that peo­ple keep work­ing on things that I think clearly won’t work, though I think there are two ex­pla­na­tions for this. One is that, un­like me, they ex­pect to ap­proach su­per­hu­man AGI slowly and so we will have many op­por­tu­ni­ties to no­tice when we are de­vi­at­ing from hu­man val­ues as a re­sult of Good­hart’s curse and make cor­rec­tions. The other is that they are sim­ply un­aware of the force of the ar­gu­ment that con­vinces me be­cause, al­though it has been writ­ten about be­fore, I have not seen re­cent, pointed ar­gu­ments for it rather than tech­ni­cal ex­pla­na­tions of it and its effects, and my grokking of this point hap­pened long ago on mailing lists of yore via more in­tu­itive and less for­mal ar­gu­ments than I see now. I can’t promise to make my points as in­tu­itive as I would like, but nonethe­less I will try to ad­dress this lat­ter ex­pla­na­tion by say­ing a few words about why I am con­vinced.

Note: Some of this bor­rows heav­ily from a pa­per I have out for pub­li­ca­tion, but with sub­stan­tial ad­di­tions for read­abil­ity by a wider au­di­ence.

Good­hart’s Curse

Good­hart’s curse is what hap­pens when Good­hart’s law meets the op­ti­mizer’s curse. Let’s re­view those two here briefly for com­plete­ness. Feel free to skip some of this if you are already fa­mil­iar.

Good­hart’s Law

As origi­nally for­mu­lated, Good­hart’s law says “Any ob­served statis­ti­cal reg­u­lar­ity will tend to col­lapse once pres­sure is placed upon it for con­trol pur­poses”. A more ac­cessible ex­pres­sion of Good­hart’s law, though, would be that when a mea­sure of suc­cess be­comes the tar­get, it ceases to be a good mea­sure. A well known ex­am­ple of Good­hart’s law comes from a pro­gram to ex­ter­mi­nate rats in French-colo­nial Hanoi, Viet­nam: the pro­gram paid a bounty for rat tails on the as­sump­tion that a rat tail rep­re­sented a dead rat, but rat catch­ers would in­stead catch rats, cut off their tails, and re­lease the rats so they could breed and pro­duce new rats so their tails could be turned in for more boun­ties. There was a similar case with boun­ties for dead co­bras in Bri­tish-colo­nial In­dia that in­tended to in­cen­tivize the re­duc­tion of co­bra pop­u­la­tions that in­stead re­sulted in the cre­ation of co­bra farms. And of course we can’t for­get this clas­sic, though apoc­ryphal, tale:

In the old Soviet Union, the gov­ern­ment re­warded fac­tory man­agers for pro­duc­tion quan­tity, not qual­ity. In ad­di­tion to ig­nor­ing qual­ity, fac­tory man­agers ig­nored cus­tomer needs. The end re­sult was more and more trac­tors pro­duced, for in­stance, even though these trac­tors just sat un­used. Man­agers of fac­to­ries that pro­duced nails op­ti­mized pro­duc­tion by pro­duc­ing ei­ther fewer larger and heav­ier nails or more smaller nails.
The fact that fac­to­ries were judged by rough phys­i­cal quo­tas rather than by their abil­ity to satisfy cus­tomers – their cus­tomers were the state – had pre­dictably bad re­sults. If, to take a real case, a nail fac­tory’s out­put was mea­sured by num­ber, fac­to­ries pro­duced large num­bers of small pink-like nails. If out­put was mea­sured by weight, the nail fac­to­ries shifted to fewer, very heavy nails.

Although a joke, the nat­u­ral end­point might be the pro­duc­tion of a sin­gle, gi­ant nail. It’s un­known, to be best of my knowl­edge, if the nail ex­am­ple above is real, al­though re­port­edly some­thing similar re­ally did hap­pen with shoes. Ad­di­tional ex­am­ples of Good­hart’s law abound:

  • tar­get­ing eas­ily-mea­sured clicks rather than con­ver­sions in on­line advertising

  • op­ti­miz­ing for prof­its over com­pany health in business

  • un­in­ten­tion­ally in­cen­tiviz­ing pub­li­ca­tion count over aca­demic progress in academia

  • pri­ori­tiz­ing grades and tests scores over learn­ing in schools

  • max­i­miz­ing score rather than hav­ing fun in video games

As these ex­am­ples demon­strate, most of us are fa­mil­iar with Good­hart’s law or some­thing similar in ev­ery­day life such that it’s not that sur­pris­ing when we learn about it. The op­po­site seems to be true of the op­ti­mizer’s curse, be­ing well stud­ied but mostly in­visi­ble to us in daily life un­less we take care to no­tice it.

The op­ti­mizer’s curse

The op­ti­mizer’s curse ob­serves that when choos­ing among sev­eral pos­si­bil­ities, if we choose the op­tion that is ex­pected to max­i­mize value, we will be “dis­ap­pointed” (re­al­ize less than the ex­pected value) more of­ten than av­er­age. This hap­pens be­cause op­ti­miza­tion acts as a source of bias in fa­vor of over­es­ti­ma­tion, even if the es­ti­mated value of each op­tion is not bi­ased it­self. And the curse is ro­bust, such that even if an agent satis­fices (ac­cepts the op­tion with the least ex­pected value that is greater than neu­tral) rather than op­ti­mizes they will still suffer more dis­ap­point­ment than grat­ifi­ca­tion. So each op­tion can be es­ti­mated in an un­bi­ased way, yet be­cause there is a bias im­posed by a prefer­ence for es­ti­ma­tions of pos­i­tive value, we can end up in a situ­a­tion where we con­sis­tently pick op­tions that are more likely to be over­es­ti­mat­ing their value.

The op­ti­mizer’s curse has many op­por­tu­ni­ties to bite us. For ex­am­ple, a com­pany try­ing to pick a pro­ject to in­vest in to earn the high­est rate of re­turn will con­sis­tently earn less re­turn than pre­dicted due to the op­ti­mizer’s curse. Same goes for an in­vestor pick­ing in­vest­ment in­stru­ments. Similarly a per­son try­ing to pick the best va­ca­tion will, on av­er­age, have a worse va­ca­tion than ex­pected be­cause the va­ca­tion that looks the best is more likely than the other op­tions to be worse than pre­dicted. And of course an AI try­ing to pick the policy that max­i­mizes hu­man value will usu­ally pick a policy that performs worse than ex­pected, but we’ll re­turn to that one later when we con­sider how it in­ter­acts with Good­hart.

I wish I had more, bet­ter ex­am­ples of the op­ti­mizer’s curse to offer you, es­pe­cially doc­u­mented real-world cases that are re­lat­able, but most of what I can find seems to be about petroleum pro­duc­tion (no, re­ally!) or oth­er­wise about man­ag­ing and choos­ing among cap­i­tal-in­ten­sive pro­jects. The best I can offer you is this story from my own life about shoes:

For a long time, I wanted to find the best shoes. “Best” could mean many things, but ba­si­cally I wanted the best shoe for all pur­poses. I wanted the shoe to be tech­ni­cally im­pres­sive, so it would have fea­tures like wa­ter­proofing, punc­ture-proofing, ex­tremely dura­bil­ity, ex­treme thin­ness and light­ness, able to be worn with­out socks, and breatha­bil­ity. I also wanted it to look classy and ca­sual, able to mix and match with any­thing. You might say this is im­pos­si­ble, but I would have said you just aren’t try­ing hard enough.
So I tried a lot of shoes. And in ev­ery case I was dis­ap­pointed. One was durable but ugly, an­other was wa­ter­proof but made my feet smell, an­other looked good but was un­com­fortable, and an­other was just too weird. The harder I tried to find the perfect shoe, the more I was dis­ap­pointed. Cursed was I for op­ti­miz­ing!

This story isn’t perfect: I was op­ti­miz­ing for mul­ti­ple vari­ables and mak­ing trade­offs, and the solu­tion was to find some set of trade­offs I would be hap­piest with and to ac­cept that I was mostly only go­ing to move along the effi­ciency fron­tier rather than ex­pand it by try­ing new shoes, so it teaches the wrong les­son un­less we look at it through a very nar­row lens. Bet­ter ex­am­ples in the com­ments are deeply ap­pre­ci­ated!

Be­fore mov­ing on, it’s worth talk­ing about at­tempts to miti­gate the op­ti­mizer’s curse. It would seem, since it is a sys­tem­atic bias, that we could ac­count for the op­ti­mizer’s curse the same way we do most sys­tem­atic bi­ases us­ing bet­ter Bayesian rea­son­ing. And we can, but in many cases this is difficult or im­pos­si­ble be­cause we lack suffi­cient in­for­ma­tion about the un­der­ly­ing dis­tri­bu­tions to make the nec­es­sary cor­rec­tions. In­stead we find our­selves in a situ­a­tion where we know we suffer bias in our ex­pec­ta­tions but can­not ad­e­quately cor­rect for it such that we can be sure we aren’t still suffer­ing from it even if we try not to. Put an­other way, at­tempt­ing to cor­rect for the op­ti­mizer’s curse with­out perfect in­for­ma­tion sim­ply shifts the dis­tor­tions caused by the op­ti­mizer’s curse to the cor­rec­tions rather than the origi­nal es­ti­mates them­selves with­out elimi­nat­ing the bias.

Given how per­sis­tent the op­ti­mizer’s curse is, it shouldn’t sur­prise us it will pop up when we try to op­ti­mize for some mea­surable tar­get, giv­ing us Good­hart’s curse.

Good­hart’s curse

Com­bin­ing Good­hart’s law with the op­ti­miz­ers curse, we get Good­hart’s curse: at­tempts to op­ti­mize for a mea­sure of suc­cess re­sult in in­creased like­li­hood of failure to hit the de­sired tar­get. Or as some­one on Ar­bital (prob­a­bly Eliezer) put it: “neu­trally op­ti­miz­ing a proxy mea­sure U of V seeks out up­ward di­ver­gence of U from V”. In per­sonal terms, you might say that the harder you try to get what you want, the more you’ll find your­self do­ing things that cause you not to get what you want de­spite try­ing to act oth­er­wise. I think this point is un­in­tu­itive be­cause it feels con­trary to the nor­mal nar­ra­tive that suc­cess comes from try­ing, and try­ing harder makes it more likely you will suc­ceed, but that might only ap­pear to be true due to sur­vivor­ship bias. To give you an in­tu­itive feel for this per­sonal ex­pres­sion of Good­hart’s curse, an­other story from my life:

At some ten­der age, maybe around 11 or 12, I be­came ob­sessed with effi­ciency so I would have time to do more in my life.
There was the lit­tle stuff, like figur­ing out the “best” way to brush my teeth or get out of bed. There was the medium stuff, like find­ing ways to read faster or write with­out mov­ing my hand as much. And there was the big stuff, like try­ing to figure out how to get by on less sleep and how to study top­ics in the op­ti­mal or­der. It touched ev­ery­thing, from shoe ty­ing, to clothes putting on, to walk­ing, to play­ing, to eat­ing, and on and on. It was per­sonal Tay­lorism gone mad.
To take a sin­gle ex­am­ple, let’s con­sider the im­por­tant ac­tivity of eat­ing break­fast ce­real and how that pro­cess can be made more effi­cient. There’s the ques­tion of how to store the ce­real, how to store the milk, how to re­trieve the ce­real and milk, how to pour the two into the bowl, how to hold the spoon, how to put the ce­real in the mouth, how to chew, how to swal­low, and how to clean up, to name just a few. Maybe I could save a few sec­onds if I held the spoon differ­ently, or stored the ce­real in a differ­ent con­tainer, or store the milk on a differ­ent shelf in the re­friger­a­tor, or, or, or. By ap­pli­ca­tion of ex­per­i­men­ta­tion and ob­ser­va­tion I could get re­ally good at eat­ing ce­real, sav­ing maybe a minute or more off my daily rou­tine!
Of course, this was out of a morn­ing rou­tine that lasted over an hour and in­cluded a lot of slack and wait­ing be­cause I had three sisters and two par­ents and lived in a house with two bath­rooms. But still, one whole minute saved!
By the time I was 13 or 14 I was over it. I had spent a cou­ple years work­ing hard at effi­ciency, got­ten lit­tle for it, and lost a lot in ex­change. Do­ing all that effi­ciency work was hard, made things that were once fun feel like work, and, worst of all, weren’t de­liv­er­ing on the origi­nal pur­pose of do­ing more with my life. I had op­ti­mized for the mea­sure—time to com­plete task, num­ber of mo­tions to com­plete task, etc.—at the ex­pense of the tar­get—get­ting more done. Yes, I was effi­cient at some things, but that effi­ciency was cost­ing so much effort and will power that I was worse off than if I had just ig­nored the kind of effi­ciency I was tar­get­ing.

In this story, as I did things that I thought would help me reach my tar­get, I ac­tu­ally moved my­self fur­ther away from it. Even­tu­ally it got bad enough that I no­ticed the di­ver­gence and was com­pel­led to course cor­rect, but this de­pended on me hav­ing ever known what the origi­nal tar­get was. If I were not the op­ti­mizer, and in­stead say some im­per­sonal ap­para­tus like the state or an AI were, there’s con­sid­er­able risk the op­ti­mizer would have kept op­ti­miz­ing and di­verg­ing long af­ter it be­came clear to me that di­ver­gence had hap­pened. For an in­tu­itive sense of how this has hap­pened his­tor­i­cally, I recom­mend See­ing Like a State.

I hope by this point you are con­vinced of the power and prevalence of Good­hart’s curse (but if not please let me know your thoughts in the com­ments, es­pe­cially if you have ideas about what could be said that would be con­vinc­ing). Now we are poised to con­sider Good­hart’s curse and its re­la­tion­ship to AI al­ign­ment.

Good­hart’s curse and AI alignment

Let’s sup­pose we want to build an AI that is al­igned with hu­man val­ues. A high level overview of a scheme for do­ing this is that we build an AI, check to see if it is al­igned with hu­man val­ues so far, and then up­date it so that it is more al­igned if it is not fully al­igned already.

Although the de­tails vary, this de­scribes roughly the way meth­ods like IRL and CIRL work, and pos­si­bly how HCH and safety via de­bate work in prac­tice. Con­se­quently, I think all of them will fail due to Good­hart’s curse.

Caveat: I think HCH and de­bate-like meth­ods may be able to work and avoid Good­hart’s curse, though I’m not cer­tain, and it would re­quire care­ful de­sign that I’m not sure the cur­rent work on these has done. I hope to have more to say on this in the fu­ture.

The way Good­hart’s curse sneaks in to these is that they all seek to ap­ply op­ti­miza­tion pres­sure to some­thing ob­serv­able that is not ex­actly the same thing as what we want. In the case of IRL and CIRL, it’s an AI op­ti­miz­ing over in­ferred val­ues rather than the val­ues them­selves. In HCH and safety via de­bate, it’s a hu­man prefer­en­tially se­lect­ing AI that the hu­man ob­serves and then comes to be­lieve does what it wants. So long as that ob­ser­va­tion step is there and we op­ti­mize based on ob­ser­va­tion, Good­hart’s curse ap­plies and we can ex­pect, with suffi­cient op­ti­miza­tion pres­sure, that al­ign­ment will be lost, even and pos­si­bly es­pe­cially with­out us notic­ing be­cause we’re fo­cused on the ob­serv­able mea­sure rather than the tar­get.

Yikes!

Beyond Good­hart’s curse

Do we have any hope of cre­at­ing al­igned AI if just mak­ing a (non-in­differ­ent) choice based on an ob­ser­va­tion dooms us to Good­hart’s curse?

Hon­estly, I don’t know. I’m pretty pes­simistic that we can solve al­ign­ment, yet in spite of this I keep work­ing on it be­cause I also be­lieve it’s the best chance we have. I sus­pect we may only be able to rule out solu­tions that are dan­ger­ous but not pos­i­tively se­lect for solu­tions that are safe, and may have to ap­proach solv­ing al­ign­ment by elimi­nat­ing ev­ery­thing that won’t work and then do­ing some­thing in the tiny space of op­tions we have left that we can’t say for sure will end in catas­tro­phe.

Maybe we can get around Good­hart’s curse by ap­ply­ing so lit­tle op­ti­miza­tion pres­sure that it doesn’t hap­pen? One pro­posal in this di­rec­tion is quan­tiliza­tion. I re­main doubt­ful, since with­out suffi­cient op­ti­miza­tion it’s not clear how we do bet­ter than pick­ing at ran­dom.

Maybe we can get around Good­hart’s curse by op­ti­miz­ing the tar­get di­rectly rather than a mea­sure of it? Again, I re­main doubt­ful, mostly due to episte­molog­i­cal is­sues sug­gest­ing all we ever have are ob­ser­va­tions and never the real thing it­self.

Maybe we can over­come ei­ther or both is­sues via prag­matic means that negate enough of the prob­lem that, al­though we don’t ac­tu­ally elimi­nate Good­hart’s curse com­pletely, we elimi­nate enough of its effect that we can ig­nore it? Given the risks and the down­sides I’m not ex­cited about this ap­proach, but it may be the best we have.

And, if all that wasn’t bad enough, Good­hart’s curse isn’t even the only thing we have to watch out for! Scott Garrabrant and David Man­heim have re­named Good­hart’s curse to “Re­gres­sional Good­hart” to dis­t­in­guish it from other forms of Good­hart­ing where mechanisms other than op­ti­miza­tion may be re­spon­si­ble for di­ver­gence from the tar­get. The only rea­son I fo­cus on Good­hart’s curse is that it’s the way pro­posed al­ign­ment schemes usu­ally fail; other safety pro­pos­als may fail via other Good­hart­ing effects.

All this makes it seem ex­tremely likely to me that we aren’t even close to solv­ing AI al­ign­ment yet, to the point that we likely haven’t even stum­bled upon the gen­eral mechanism that will work, or if we have we haven’t iden­ti­fied it as such. Thus, if there’s any­thing up­ward look­ing I can end this on, it’s that there’s vast op­por­tu­nity to do good for the world via work on AI safety.