The Three Levels of Goodhart's Curse

Note: I now consider this post deprecated and instead recommend this updated version.

Goodhart’s curse is a neologism by Eliezer Yudkowsky stating that “neutrally optimizing a proxy measure U of V seeks out upward divergence of U from V.” It is related to many near by concepts (e.g. the tails come apart, winner’s curse, optimizer’s curse, regression to the mean, overfitting, edge instantiation, goodhart’s law). I claim that there are three main mechanisms through which Goodhart’s curse operates.

Goodhart’s Curse Level 1 (regressing to the mean): We are trying to optimize the value of $V$ , but since we cannot observe $V$ , we instead optimize a proxy $U$ , which is an unbiased estimate of $V$ . When we select for points with a high $U$ value, we will be biased towards points for which $U$ is an overestimate of $V$ .

As a simple example imagine $V$ and $E$ (for error) are independently normally distributed with mean 0 and variance 1, and $U = V + E$ . If we sample many points and take the one with the largest $U$ value, we can predict that $E$ will likely be positive for this point, and thus the $U$ value will predictably be an overestimate of the $V$ value.

In many cases, (like the one above) the best you can do without observing $V$ is still to take the largest $U$ value you can find, but you should still expect that this $U$ value overestimates $V$ .

Similarly, if $U$ is not necessarily an unbiased estimator of $V$ , but $U$ and $V$ are correlated, and you sample a million points and take the one with the highest $U$ value, you will end up with a $V$ value on average strictly less than if you could just take a point with a one in a million $V$ value directly.

Goodhart’s Curse Level 2 (optimizing away the correlation): Here, we assume $U$ and $V$ are correlated on average, but there may be different regions in which this correlation of stronger or weaker. When we optimize $U$ to be very high, we zoom in on the region of very large $U$ values. This region could in principle have very small $V$ values.

As a very simple example imagine $U$ is integer uniform between 0 and 1000 inclusive, and $V$ is equal to $U$ mod 1000. Overall, $U$ and $V$ are correlated. The point where $U$ is 1000 and $V$ is 0 is an outlier, but it is only one point and does not sway the correlation that much. However, when we apply a lot of optimization pressure, we through away all the points with low $U$ values, and left with a small number of extreme points. Since this is a small number of points, the correlation between $U$ and $V$ says little about what value $V$ will take.

Another more realistic example is that $U$ and $V$ are two correlated dimensions in a multivariate normal distribution, but we cut off the normal distribution to only include the disk of points in which $U^{2} + V^{2} < n$ for some large $n$ . This example represents a correlation between $U$ and $V$ in naturally occurring points, but also a boundary around what types of feasible that need not respect this correlation.

Imagine you were to sample $k$ points in the above example and take the one with the largest $U$ value. As you increase $k$ , at first, this optimization pressure lets you find better and better points for both $U$ and $V$ , but as you increase $k$ to infinity, eventually you sample so many points that you will find a point near $U = \sqrt{n}, V = 0$ . When enough optimization pressure was applied, the correlation between $U$ and $V$ stopped mattering, and instead the boundary of what kinds of points were possible at all decided what kind of point was selected.

Goodhart’s Curse Level 3 (adversarial correlations): Here we are selecting a world with a high $U$ value because we want a would with a high $V$ value, and we believe $U$ to a good proxy for $V$ . However, there is another agent who wants to optimize some other value $W$ . Assume that $W$ and $V$ are contradictory. Points with hight $W$ value necessarily have low $V$ value, since the demand using similar resources.

Since you are using $U$ as a proxy, this other agent is incentivized to make $U$ and $W$ correlated as much as it can. It wants to cause your process which selects a large $U$ value to also select a large $W$ value (and thus a small $V$ value).

Making $U$ and $W$ correlated may be difficult, but thanks to Level 2 of Goodhart’s Curse, the adversary need only make them correlated at the extreme values of $U$ .

For example if you run an company, and you have an programmer employee that you want to create a working product ( $V$ ). You incentivize the employee by selecting for or rewarding employees that produce a large number of lines of code ( $U$ ). The employee wants you to pay him to slack off all day ( $W$ ). $W$ and $V$ are contradictory. The employee is incentivized to make worlds with high $U$ also have high $W$ , and thus have low $V$ . Thus, the employee may adversarially write a script to generate a bunch of random lines of code that do nothing, giving himself more time to slack off.

Level 3 is the thing most behind the original Goodhart’s Law (although level 2 contributes as well.)

Level 3 also is the mechanism behind a superintelligent AI making a Treacherous Turn. Here, $V$ is doing what the human’s want forever. $U$ is doing what the humans want before in the training cases where the AI does not have enough power to take over, and $W$ is whatever the AI wants to do with the universe.

Finally, Level 3 is also behind the malignancy of the universal prior, where you want to predict well forever (V), so hypotheses might predict well for a while (U), so that they can manipulate the world with their future predictions (W).

The Three Levels of Goodhart’s Curse