Bounding Goodhart’s Law

Good­hart’s law seems to sug­gest that er­rors in util­ity or re­ward func­tion speci­fi­ca­tion are nec­es­sar­ily bad in sense that an op­ti­mal policy for the in­cor­rect re­ward func­tion would re­sult in low re­turn ac­cord­ing to the true re­ward. But how strong is this effect?

Sup­pose the re­ward func­tion were only slightly wrong. Can the re­sult­ing policy be ar­bi­trar­ily bad ac­cord­ing to the true re­ward or is it only slightly worse? It turns out the an­swer is “only slightly worse” (for the ap­pro­pri­ate defi­ni­tion of “slightly wrong”).


Con­sider a Markov De­ci­sion Pro­cess (MDP) where

  • is the set of states,
  • is the set of ac­tions,
  • are the con­di­tional tran­si­tion prob­a­bil­ities, and
  • is the re­ward func­tion. (Note: “re­ward” is stan­dard ter­minol­ogy for MDPs but it’s fine to think of this as “util­ity”)

A policy is a map­ping from states to dis­tri­bu­tions over ac­tions with.

Any given policy in­duces a dis­tri­bu­tion over states in this MDP. If we are con­cerned about av­er­age re­ward we can take to be the sta­tion­ary dis­tri­bu­tion or, if the en­vi­ron­ment is epi­sodic, we can take to be the dis­tri­bu­tion of states vis­ited dur­ing the epi­sode. The ex­act defi­ni­tion is not par­tic­u­larly im­por­tant for us.

Define the re­turn of policy ac­cord­ing to re­ward func­tion to be

Good­hart Regret

Sup­pose we have an ap­prox­i­mate re­ward sig­nal and we use it to spec­ify a policy . How bad is ac­cord­ing to the true re­ward ?

More speci­fi­cally, what is the re­gret of us­ing com­pared to the op­ti­mal policy ? For­mally,

We can ex­pand this as

Let then if the fol­low­ing con­di­tions are satis­fied by and :




Con­di­tion 2 says that is not much worse than when mea­sured against . That is what we ex­pect if we de­signed to be speci­fi­cally good at , so con­di­tion 2 is just a for­mal­iza­tion of the no­tion that is tai­lored to .

Con­di­tions 1 and 3 com­pare a fixed policy against two differ­ent re­ward func­tions. In gen­eral for policy and re­ward func­tions and ,

Re­sult: Uniformly Bounded Error

As­sume that we have a re­ward ap­prox­i­ma­tion with uniformly bounded er­ror. That is, . Take .

Then . (Con­di­tion 2 has bound 0 in this case).

Re­sult: One-sided Er­ror Bounds

A uniform bound on the er­ror is a stronger con­di­tion than we re­ally need. The con­di­tions on can be re-writ­ten:

1. ; does not sub­stan­tially un­der­es­ti­mate the re­ward in the re­gions of state-space that are fre­quently vis­ited by .

3. ; does not sub­stan­tially over­es­ti­mate the re­ward in the re­gions of state-space that are fre­quently vis­ited by .

In other words, it doesn’t mat­ter if the re­ward es­ti­mate is too low for states that doesn’t want to visit any­ways. This tells us that we should pre­fer bi­as­ing our re­ward ap­prox­i­ma­tion to be low in the ab­sence of more in­for­ma­tion. We do need to be care­ful about not over­es­ti­mat­ing where does visit, which is made difficult by the fact that con­di­tion 2 pushes to visit states that as­signs high re­ward.

If we don’t know what is then it might be difficult to en­sure we don’t un­der­es­ti­mate the re­ward over . We prob­a­bly have ac­cess to so we might ex­pect to have bet­ter re­ward ap­prox­i­ma­tions over and there­fore have an eas­ier time satis­fy­ing con­di­tion 3 than 1.


The bound on does not ac­tu­ally re­quire to be an op­ti­mal policy for . We can take to be any policy and if we can satisfy the 3 bounds then will be not much worse than (and could be bet­ter). Us­ing a known policy for likely makes it eas­ier to satisfy con­di­tion 1, which is an ex­pec­ta­tion over the state-ac­tion dis­tri­bu­tion of .

Ex­am­ple: Hu­man Re­ward Learning

Since need not be op­ti­mal, we can take to be the policy of a hu­man and try to learn our re­ward /​ util­ity func­tion. Note: this is a very flawed pro­posal, its pur­pose is to demon­strate how one might think about us­ing these bounds in re­ward learn­ing, not to be a con­crete ex­am­ple of safe re­ward learn­ing.

Sup­pose that there is some ideal re­ward func­tion that we’d like to ap­prox­i­mate. We don’t know in gen­eral but imag­ine we can eval­u­ate it for state-ac­tion pairs that are performed by a hu­man. Let be a col­lec­tion of hu­man demon­stra­tions with la­bel­led re­ward.

Con­sider the fol­low­ing al­gorithm:

1. Fit to so that it does not un­der­es­ti­mate.

2. Set .

3. Col­lect a batch of demon­stra­tions from (no re­ward la­bels).

4. Aug­ment with some ad­di­tional hu­man demon­stra­tions.

5. Mod­ify to as­sign lower re­ward to all of while not un­der­es­ti­mat­ing on

6. Re­peat from 2.

As­sum­ing is suffi­ciently ex­pres­sive, this al­gorithm pushes and to­wards satis­fy­ing all three con­di­tions: does not un­der­es­ti­mate on the dis­tri­bu­tion of hu­man-vis­ited states, is oth­er­wise as low as pos­si­ble on states vis­its, and is op­ti­mal for . If these are all met, the re­sult­ing policy would be no worse than the hu­man at max­i­miz­ing and pos­si­bly much bet­ter.

Com­ments and Takeaways

Good­hart’s law is more im­pact­ful in the con­text of the sparse re­ward set­ting, in which ap­prox­i­ma­tion means de­cid­ing what to value, not how much to value. Con­sider a state space that is the real line and sup­pose that the true re­ward func­tion is where is the in­di­ca­tor func­tion.

If we es­ti­mate

then Good­hart’s law ap­plies and we can ex­pect that a policy op­ti­mized for will have high re­gret on .

On the other hand, if we es­ti­mate

then the re­gret bounds ap­ply and a policy op­ti­mized for will do very well on .

  • As­sign­ing high value to the wrong things is bad.

  • As­sign­ing the wrong value to the right things is not too bad.

  • Throw­ing a large amount of op­ti­miza­tion power against an in­cor­rect util­ity func­tion is not always bad.