Vanessa Kosoy comments on When is Goodhart catastrophic?

Vanessa Kosoy 24 Dec 2024 13:37 UTC
LW: 6 AF: 5
2
AF
This post provides a mathematical analysis of a toy model of Goodhart’s Law. Namely, it assumes that the optimization proxy $U$ is a sum of the true utility function $V$ and noise $X$ , such that:
- $V$ and $X$ are independent random variables w.r.t. some implicit distribution $ζ$ on the solution space. The meaning of this distribution is not discussed, but I guess we might think of it some kind of inductive bias, e.g. a simplicity prior.
- The optimization process can be modeled as conditioning $ζ$ on a high value of $U = V + X$ .
In this model, the authors prove that Goodhart occurs when $X$ is subexponential and its tail is sufficiently heavier than that of $V$ . Conversely, when $X$ is sufficiently light-tailed, Goodhart doesn’t occur.
My opinion:
On the one hand, kudos for using actual math to study an alignment-relevant problem.
On the other hand, the modeling assumptions feel too toyish for most applications. Specifically, the idea that $V$ and $X$ are independent random variables seems implausible. Typically, we worry about Goodhart’s law because the proxy behaves differently in different domains. In the “ordinary” domain that motivated the choice of proxy, $U$ is a good approximation of $V$ . However, in other domains $U$ might be unrelated to $V$ or even anticorrelated.
For example, ordinarily smiles on human-looking faces is an indication of happy humans. However, in worlds that contain much more inanimate facsimiles of humans than actual humans, there is no correlation.
Or, to take the example used in the post, ordinarily if a sufficiently smart expert human judge reads an AI alignment proposal, they form a good opinion on how good this proposal is. But, if the proposal contains superhumanly clever manipulation and psychological warfare, the ordinary relationship completely breaks down. I don’t expect this effect to behave like independent random noise at all.
Less importantly, it might be interesting to extend this analysis to a more realistic model of optimization. For example, the optimizer learns a function $F$ which is the best approximation to $U$ out of some hypothesis class $H$ , and then optimizes $F$ instead of the actual $U$ . (Incidentally, this might generate an additional Goodhart effect due to the discrepancy between $F$ and $U$ .) Alternatively, the optimizer learns an infrafunction $Φ$ that is a coarsening of $U$ out of some hypothesis class $H$ and then optimizes $Φ$ .