Goodhart’s Law states that when a proxy for some value becomes the target of optimization pressure, the proxy will cease to be a good proxy. One form of Goodhart is demonstrated by the Soviet story of a factory graded on how many shoes they produced (a good proxy for productivity) – they soon began producing a higher number of tiny shoes. Useless, but the numbers look good.
Goodhart’s Law is of particular relevance to AI Alignment. Suppose you have something which is generally a good proxy for “the stuff that humans care about”, it would be dangerous to have a powerful AI optimize for the proxy, in accordance with Goodhart’s law, the proxy will breakdown.
In Goodhart Taxonomy, Scott Garrabrant identifies four kinds of Goodharting:
Regressional Goodhart—When selecting for a proxy measure, you select not only for the true goal, but also for the difference between the proxy and the goal.
Causal Goodhart—When there is a non-causal correlation between the proxy and the goal, intervening on the proxy may fail to intervene on the goal.
Extremal Goodhart—Worlds in which the proxy takes an extreme value may be very different from the ordinary worlds in which the correlation between the proxy and the goal was observed.
Adversarial Goodhart—When you optimize for a proxy, you provide an incentive for adversaries to correlate their goal with your proxy, thus destroying the correlation with your goal.