Constructing Goodhart

A re­cent ques­tion from Scott Garrabrant brought up the is­sue of for­mal­iz­ing Good­hart’s Law. The prob­lem is to come up with some model sys­tem where op­ti­miz­ing for some­thing which is al­most-but-not-quite the thing you re­ally want pro­duces worse re­sults than not op­ti­miz­ing at all. Con­sid­er­ing how en­demic Good­hart’s Law is in the real world, this is sur­pris­ingly non-triv­ial.

Let’s start sim­ple: we have some true ob­jec­tive , and we want to choose x to max­i­mize it. Sadly, we don’t ac­tu­ally have any way to de­ter­mine the true value for a given value  — but we can de­ter­mine , where is some ran­dom func­tion of . Peo­ple talked about this fol­low­ing Scott’s ques­tion, so I won’t math it out here, but the main an­swer is that more op­ti­miza­tion of still im­proves on av­er­age over a wide va­ri­ety of as­sump­tions. John Maxwell put it nicely in his an­swer to Scott’s ques­tion:

If your proxy con­sists of some­thing you’re try­ing to max­i­mize plus un­re­lated noise that’s roughly con­stant in mag­ni­tude, you’re still best off max­i­miz­ing the heck out of that proxy, be­cause the very high­est value of the proxy will tend to be a point where the noise is high and the thing you’re try­ing to max­i­mize is also high.

In short: ab­sent some much more sub­stan­tive as­sump­tions, there is no Good­hart effect.

Rather than generic ran­dom func­tions, I sug­gest think­ing about Good­hart on a causal DAG in­stead. As an ex­am­ple, I’ll use the old story about so­viet nail fac­to­ries eval­u­ated on num­ber of nails made, and pro­duc­ing huge num­bers of tiny use­less nails.

We re­ally want to op­ti­mize some­thing like the to­tal eco­nomic value of nails pro­duced. There’s some com­pli­cated causal net­work lead­ing from the fac­tory’s in­puts to the eco­nomic value of its out­puts (we’ll use a dra­mat­i­cally sim­plified net­work as an ex­am­ple).

If we pick a spe­cific cross-sec­tion of that net­work, we find that eco­nomic value is me­di­ated by num­ber of nails, size, and strength — those vari­ables are enough to de­ter­mine the ob­jec­tive. All the in­puts fur­ther up in­fluence the ob­jec­tive by chang­ing num­ber, size, and/​or strength of nails.

Now, we choose num­ber of nails as a proxy for the ob­jec­tive. If we were just us­ing this proxy to op­ti­mize ma­chine count, that would be fine — ma­chine count only in­fluences our ob­jec­tive via num­ber of nails pro­duced, it doesn’t effect size or strength, so num­ber of nails is a fine proxy for our true ob­jec­tive for the pur­pose of or­der­ing ma­chines. But mould shape is an­other mat­ter. Mould shape effects both num­ber and size, so we can use a smaller mold to in­crease num­ber of nails while de­creas­ing size. If we’re us­ing num­ber as a proxy for the true ob­jec­tive, ig­nor­ing size and strength, then that’s go­ing to cause a prob­lem.

Gen­er­al­iz­ing: we have a com­pli­cated causal DAG which de­ter­mines some out­put we re­ally want to op­ti­mize. We no­tice that some node in the mid­dle of that DAG is highly pre­dic­tive of happy out­puts, so we op­ti­mize for that thing as a proxy. If our proxy were a bot­tle­neck in the DAG — i.e. it’s on ev­ery pos­si­ble path from in­puts to out­put — then that would work just fine. But in prac­tice, there are other nodes in par­allel to the proxy which also mat­ter for the out­put — in our ex­am­ple, size and shape. By op­ti­miz­ing for the proxy, we ac­cept trade-offs which harm nodes in par­allel to it, which po­ten­tially adds up to net-harm­ful effect on the out­put.

So we have a model which can po­ten­tially give rise to Good­hart, but will it? If we con­struct a ran­dom DAG, choose a proxy node close to the ob­jec­tive, and op­ti­mize for that proxy, we prob­a­bly won’t see a Good­hart effect (at least not right away). Why not? Well, if we’ve just ini­tial­ized all the pa­ram­e­ters ran­domly, then what­ever change we make to op­ti­mize for num­ber of nails is just as likely to im­prove other sub-ob­jec­tives as to harm them. For in­stance, if we’re start­ing off with a ran­dom mould, then it’s just as likely to be too big as too small — if it’s pro­duc­ing gi­ant use­less nails, then shrink­ing the mould im­proves both num­ber and size of nails.

Of course, in the real world, we prob­a­bly wouldn’t be start­ing from a gi­ant use­less mould. Good­hart hits in the real world be­cause we’re not just start­ing from ran­dom points, we’re start­ing from points which have had some op­ti­miza­tion already. But we’re not start­ing from the best pos­si­ble point — then any change would be bad, proxy op­ti­miza­tion or not. Rather, I ex­pect that most real sys­tems are start­ing from a pareto-op­ti­mal point.

Here’s why: look at the cross-sec­tion of our causal DAG from ear­lier. Num­ber, size, strength… in the busi­ness world, we’d call these key perfor­mance in­di­ca­tors (KPIs) for the fac­tory. If some­thing ob­vi­ously im­proves one or more KPIs with­out any down­side, then usu­ally ev­ery­one im­me­di­ately agrees that it should be done. That’s the gen­er­al­ized effi­cient mar­kets hy­poth­e­sis, on su­per-easy mode. Without trade-offs, op­ti­miza­tion is triv­ial. Add trade-offs, and things get con­tentious: there’s a trade-off be­tween num­ber and size, so the qual­ity as­surance de­part­ment gets into an ar­gu­ment with the sales de­part­ment about how to han­dle the trade-off, and some agree­ment is ham­mered out which prob­a­bly isn’t all that op­ti­mal.

If we’ve made all the op­ti­miza­tions we can with­out get­ting into trade-offs, then we’re at a pareto op­ti­mal point: we can­not im­prove any KPI with­out harm­ing some other KPI. If we ex­pect those op­ti­miza­tions to be easy and to hap­pen all the time, then we should ex­pect to usu­ally end up at pareto op­tima.

And if we’re already at a pareto op­ti­mum, and we start op­ti­miz­ing for some proxy ob­jec­tive, then we’re definitely go­ing to harm all the other ob­jec­tives. That’s the whole point of pareto op­ti­mal­ity, af­ter all: we can’t im­prove one thing with­out trad­ing off against some­thing else. That doesn’t mean that we’ll see net harm to the true ob­jec­tive right away; even if we’re pareto op­ti­mal, we could be start­ing from a point with far too few nails pro­duced. If the fac­tory has a cul­ture of un­nec­es­sary perfec­tion­ism, then push­ing for higher nail count may help. But keep push­ing, and we’ll slide down the pareto curve past the op­ti­mal point and into un­happy ter­ri­tory. That’s the mark of a Good­hart effect.