Bound­ing Good­hart’s Law

Good­hart’s law seems to sug­gest that er­rors in util­ity or re­ward func­tion spe­cific­a­tion are ne­ces­sar­ily bad in sense that an op­timal policy for the in­cor­rect re­ward func­tion would res­ult in low re­turn ac­cord­ing to the true re­ward. But how strong is this ef­fect?

Sup­pose the re­ward func­tion were only slightly wrong. Can the res­ult­ing policy be ar­bit­rar­ily bad ac­cord­ing to the true re­ward or is it only slightly worse? It turns out the an­swer is “only slightly worse” (for the ap­pro­pri­ate defin­i­tion of “slightly wrong”).


Con­sider a Markov De­cision Pro­cess (MDP) where

  • is the set of states,
  • is the set of ac­tions,
  • are the con­di­tional trans­ition prob­ab­il­it­ies, and
  • is the re­ward func­tion. (Note: “re­ward” is stand­ard ter­min­o­logy for MDPs but it’s fine to think of this as “util­ity”)

A policy is a map­ping from states to dis­tri­bu­tions over ac­tions with.

Any given policy in­duces a dis­tri­bu­tion over states in this MDP. If we are con­cerned about av­er­age re­ward we can take to be the sta­tion­ary dis­tri­bu­tion or, if the en­vir­on­ment is epis­odic, we can take to be the dis­tri­bu­tion of states vis­ited dur­ing the epis­ode. The ex­act defin­i­tion is not par­tic­u­larly im­port­ant for us.

Define the re­turn of policy ac­cord­ing to re­ward func­tion to be

Good­hart Regret

Sup­pose we have an ap­prox­im­ate re­ward sig­nal and we use it to spe­cify a policy . How bad is ac­cord­ing to the true re­ward ?

More spe­cific­ally, what is the re­gret of us­ing com­pared to the op­timal policy ? Formally,

We can ex­pand this as

Let then if the fol­low­ing con­di­tions are sat­is­fied by and :




Condi­tion 2 says that is not much worse than when meas­ured against . That is what we ex­pect if we de­signed to be spe­cific­ally good at , so con­di­tion 2 is just a form­al­iz­a­tion of the no­tion that is tailored to .

Condi­tions 1 and 3 com­pare a fixed policy against two dif­fer­ent re­ward func­tions. In gen­eral for policy and re­ward func­tions and ,

Res­ult: Uni­formly Boun­ded Error

As­sume that we have a re­ward ap­prox­im­a­tion with uni­formly bounded er­ror. That is, . Take .

Then . (Condi­tion 2 has bound 0 in this case).

Res­ult: One-sided Er­ror Bounds

A uni­form bound on the er­ror is a stronger con­di­tion than we really need. The con­di­tions on can be re-writ­ten:

1. ; does not sub­stan­tially un­der­es­tim­ate the re­ward in the re­gions of state-space that are fre­quently vis­ited by .

3. ; does not sub­stan­tially over­es­tim­ate the re­ward in the re­gions of state-space that are fre­quently vis­ited by .

In other words, it doesn’t mat­ter if the re­ward es­tim­ate is too low for states that doesn’t want to visit any­ways. This tells us that we should prefer bi­as­ing our re­ward ap­prox­im­a­tion to be low in the ab­sence of more in­form­a­tion. We do need to be care­ful about not over­es­tim­at­ing where does visit, which is made dif­fi­cult by the fact that con­di­tion 2 pushes to visit states that as­signs high re­ward.

If we don’t know what is then it might be dif­fi­cult to en­sure we don’t un­der­es­tim­ate the re­ward over . We prob­ably have ac­cess to so we might ex­pect to have bet­ter re­ward ap­prox­im­a­tions over and there­fore have an easier time sat­is­fy­ing con­di­tion 3 than 1.


The bound on does not ac­tu­ally re­quire to be an op­timal policy for . We can take to be any policy and if we can sat­isfy the 3 bounds then will be not much worse than (and could be bet­ter). Using a known policy for likely makes it easier to sat­isfy con­di­tion 1, which is an ex­pect­a­tion over the state-ac­tion dis­tri­bu­tion of .

Example: Hu­man Re­ward Learning

Since need not be op­timal, we can take to be the policy of a hu­man and try to learn our re­ward /​ util­ity func­tion. Note: this is a very flawed pro­posal, its pur­pose is to demon­strate how one might think about us­ing these bounds in re­ward learn­ing, not to be a con­crete ex­ample of safe re­ward learn­ing.

Sup­pose that there is some ideal re­ward func­tion that we’d like to ap­prox­im­ate. We don’t know in gen­eral but ima­gine we can eval­u­ate it for state-ac­tion pairs that are per­formed by a hu­man. Let be a col­lec­tion of hu­man demon­stra­tions with la­belled re­ward.

Con­sider the fol­low­ing al­gorithm:

1. Fit to so that it does not un­der­es­tim­ate.

2. Set .

3. Col­lect a batch of demon­stra­tions from (no re­ward la­bels).

4. Aug­ment with some ad­di­tional hu­man demon­stra­tions.

5. Modify to as­sign lower re­ward to all of while not un­der­es­tim­at­ing on

6. Repeat from 2.

As­sum­ing is suf­fi­ciently ex­press­ive, this al­gorithm pushes and to­wards sat­is­fy­ing all three con­di­tions: does not un­der­es­tim­ate on the dis­tri­bu­tion of hu­man-vis­ited states, is oth­er­wise as low as pos­sible on states vis­its, and is op­timal for . If these are all met, the res­ult­ing policy would be no worse than the hu­man at max­im­iz­ing and pos­sibly much bet­ter.

Com­ments and Takeaways

Good­hart’s law is more im­pact­ful in the con­text of the sparse re­ward set­ting, in which ap­prox­im­a­tion means de­cid­ing what to value, not how much to value. Con­sider a state space that is the real line and sup­pose that the true re­ward func­tion is where is the in­dic­ator func­tion.

If we es­tim­ate

then Good­hart’s law ap­plies and we can ex­pect that a policy op­tim­ized for will have high re­gret on .

On the other hand, if we es­tim­ate

then the re­gret bounds ap­ply and a policy op­tim­ized for will do very well on .

  • Assign­ing high value to the wrong things is bad.

  • Assign­ing the wrong value to the right things is not too bad.

  • Throw­ing a large amount of op­tim­iz­a­tion power against an in­cor­rect util­ity func­tion is not al­ways bad.