The reverse Goodhart problem

There are two aspects to the Goodhart problem which are often conflated. One is trivially true for all proxy-true utility pairs; but the other is not.

Following this terminology, we’ll say that is the true goal, and is the proxy. In the range of circumstances we’re used to, - that’s what’s makes a good proxy. Then the Goodhart problem has two aspects to it:

  1. Maximising does not increase as much as maximising would.

  2. When strongly maximising , starts to increase at a slower rate, and ultimately starts decreasing.

Aspect 1. is a tautology: the best way to maximise is to… maximise . Hence maximising is almost certainly less effective at increasing than maximising directly.

But aspect 2. is not a tautology, and need not be true for generic proxy-true utility pairs . For instance, some pairs have the reverse Goodhart problem:

  1. When strongly maximising , starts to increase at a faster rate, and ultimately starts increasing more than twice as fast as .

Are there utility functions that have anti-Goodhart problems? Yes, many. If have a Goodhart problem, then has an anti-Goodhart problem if .

Then in the range of circumstances we’re used to, . And, as starts growing slower than , starts growing faster; when starts decreasing, starts growing more than twice as fast as :

Are there more natural utility functions that have anti-Goodhart problems? Yes. If for instance you’re a total or average utilitarian, and you maximise the proxy “do the best for the worst off”. In general, if is your true utility and is a prioritarian/​conservative version of (eg or or other concave, increasing functions) then we have reverse Goodhart behaviour[1].

So saying that we expect Goodhart problems (in the second sense) means that we know something special about (and ). It’s not a generic problem for all utility functions, but for the ones we expect to correspond to human preferences.


  1. ↩︎

    We also need to scale the proxy so that on the typical range of circumstances; thus the conservatism of is only visible away from the typical range.