Bounding Goodhart’s Law

Goodhart’s law seems to suggest that errors in utility or reward function specification are necessarily bad in sense that an optimal policy for the incorrect reward function would result in low return according to the true reward. But how strong is this effect?

Suppose the reward function were only slightly wrong. Can the resulting policy be arbitrarily bad according to the true reward or is it only slightly worse? It turns out the answer is “only slightly worse” (for the appropriate definition of “slightly wrong”).


Consider a Markov Decision Process (MDP) where

  • is the set of states,
  • is the set of actions,
  • are the conditional transition probabilities, and
  • is the reward function. (Note: “reward” is standard terminology for MDPs but it’s fine to think of this as “utility”)

A policy is a mapping from states to distributions over actions with.

Any given policy induces a distribution over states in this MDP. If we are concerned about average reward we can take to be the stationary distribution or, if the environment is episodic, we can take to be the distribution of states visited during the episode. The exact definition is not particularly important for us.

Define the return of policy according to reward function to be

Goodhart Regret

Suppose we have an approximate reward signal and we use it to specify a policy . How bad is according to the true reward ?

More specifically, what is the regret of using compared to the optimal policy ? Formally,

We can expand this as

Let then if the following conditions are satisfied by and :




Condition 2 says that is not much worse than when measured against . That is what we expect if we designed to be specifically good at , so condition 2 is just a formalization of the notion that is tailored to .

Conditions 1 and 3 compare a fixed policy against two different reward functions. In general for policy and reward functions and ,

Result: Uniformly Bounded Error

Assume that we have a reward approximation with uniformly bounded error. That is, . Take .

Then . (Condition 2 has bound 0 in this case).

Result: One-sided Error Bounds

A uniform bound on the error is a stronger condition than we really need. The conditions on can be re-written:

1. ; does not substantially underestimate the reward in the regions of state-space that are frequently visited by .

3. ; does not substantially overestimate the reward in the regions of state-space that are frequently visited by .

In other words, it doesn’t matter if the reward estimate is too low for states that doesn’t want to visit anyways. This tells us that we should prefer biasing our reward approximation to be low in the absence of more information. We do need to be careful about not overestimating where does visit, which is made difficult by the fact that condition 2 pushes to visit states that assigns high reward.

If we don’t know what is then it might be difficult to ensure we don’t underestimate the reward over . We probably have access to so we might expect to have better reward approximations over and therefore have an easier time satisfying condition 3 than 1.


The bound on does not actually require to be an optimal policy for . We can take to be any policy and if we can satisfy the 3 bounds then will be not much worse than (and could be better). Using a known policy for likely makes it easier to satisfy condition 1, which is an expectation over the state-action distribution of .

Example: Human Reward Learning

Since need not be optimal, we can take to be the policy of a human and try to learn our reward /​ utility function. Note: this is a very flawed proposal, its purpose is to demonstrate how one might think about using these bounds in reward learning, not to be a concrete example of safe reward learning.

Suppose that there is some ideal reward function that we’d like to approximate. We don’t know in general but imagine we can evaluate it for state-action pairs that are performed by a human. Let be a collection of human demonstrations with labelled reward.

Consider the following algorithm:

1. Fit to so that it does not underestimate.

2. Set .

3. Collect a batch of demonstrations from (no reward labels).

4. Augment with some additional human demonstrations.

5. Modify to assign lower reward to all of while not underestimating on

6. Repeat from 2.

Assuming is sufficiently expressive, this algorithm pushes and towards satisfying all three conditions: does not underestimate on the distribution of human-visited states, is otherwise as low as possible on states visits, and is optimal for . If these are all met, the resulting policy would be no worse than the human at maximizing and possibly much better.

Comments and Takeaways

Goodhart’s law is more impactful in the context of the sparse reward setting, in which approximation means deciding what to value, not how much to value. Consider a state space that is the real line and suppose that the true reward function is where is the indicator function.

If we estimate

then Goodhart’s law applies and we can expect that a policy optimized for will have high regret on .

On the other hand, if we estimate

then the regret bounds apply and a policy optimized for will do very well on .

  • Assigning high value to the wrong things is bad.

  • Assigning the wrong value to the right things is not too bad.

  • Throwing a large amount of optimization power against an incorrect utility function is not always bad.