Constructing Goodhart

A recent question from Scott Garrabrant brought up the issue of formalizing Goodhart’s Law. The problem is to come up with some model system where optimizing for something which is almost-but-not-quite the thing you really want produces worse results than not optimizing at all. Considering how endemic Goodhart’s Law is in the real world, this is surprisingly non-trivial.

Let’s start simple: we have some true objective , and we want to choose x to maximize it. Sadly, we don’t actually have any way to determine the true value for a given value  — but we can determine , where is some random function of . People talked about this following Scott’s question, so I won’t math it out here, but the main answer is that more optimization of still improves on average over a wide variety of assumptions. John Maxwell put it nicely in his answer to Scott’s question:

If your proxy consists of something you’re trying to maximize plus unrelated noise that’s roughly constant in magnitude, you’re still best off maximizing the heck out of that proxy, because the very highest value of the proxy will tend to be a point where the noise is high and the thing you’re trying to maximize is also high.

In short: absent some much more substantive assumptions, there is no Goodhart effect.

Rather than generic random functions, I suggest thinking about Goodhart on a causal DAG instead. As an example, I’ll use the old story about soviet nail factories evaluated on number of nails made, and producing huge numbers of tiny useless nails.

We really want to optimize something like the total economic value of nails produced. There’s some complicated causal network leading from the factory’s inputs to the economic value of its outputs (we’ll use a dramatically simplified network as an example).

If we pick a specific cross-section of that network, we find that economic value is mediated by number of nails, size, and strength — those variables are enough to determine the objective. All the inputs further up influence the objective by changing number, size, and/​or strength of nails.

Now, we choose number of nails as a proxy for the objective. If we were just using this proxy to optimize machine count, that would be fine — machine count only influences our objective via number of nails produced, it doesn’t effect size or strength, so number of nails is a fine proxy for our true objective for the purpose of ordering machines. But mould shape is another matter. Mould shape effects both number and size, so we can use a smaller mold to increase number of nails while decreasing size. If we’re using number as a proxy for the true objective, ignoring size and strength, then that’s going to cause a problem.

Generalizing: we have a complicated causal DAG which determines some output we really want to optimize. We notice that some node in the middle of that DAG is highly predictive of happy outputs, so we optimize for that thing as a proxy. If our proxy were a bottleneck in the DAG — i.e. it’s on every possible path from inputs to output — then that would work just fine. But in practice, there are other nodes in parallel to the proxy which also matter for the output — in our example, size and shape. By optimizing for the proxy, we accept trade-offs which harm nodes in parallel to it, which potentially adds up to net-harmful effect on the output.

So we have a model which can potentially give rise to Goodhart, but will it? If we construct a random DAG, choose a proxy node close to the objective, and optimize for that proxy, we probably won’t see a Goodhart effect (at least not right away). Why not? Well, if we’ve just initialized all the parameters randomly, then whatever change we make to optimize for number of nails is just as likely to improve other sub-objectives as to harm them. For instance, if we’re starting off with a random mould, then it’s just as likely to be too big as too small — if it’s producing giant useless nails, then shrinking the mould improves both number and size of nails.

Of course, in the real world, we probably wouldn’t be starting from a giant useless mould. Goodhart hits in the real world because we’re not just starting from random points, we’re starting from points which have had some optimization already. But we’re not starting from the best possible point — then any change would be bad, proxy optimization or not. Rather, I expect that most real systems are starting from a pareto-optimal point.

Here’s why: look at the cross-section of our causal DAG from earlier. Number, size, strength… in the business world, we’d call these key performance indicators (KPIs) for the factory. If something obviously improves one or more KPIs without any downside, then usually everyone immediately agrees that it should be done. That’s the generalized efficient markets hypothesis, on super-easy mode. Without trade-offs, optimization is trivial. Add trade-offs, and things get contentious: there’s a trade-off between number and size, so the quality assurance department gets into an argument with the sales department about how to handle the trade-off, and some agreement is hammered out which probably isn’t all that optimal.

If we’ve made all the optimizations we can without getting into trade-offs, then we’re at a pareto optimal point: we cannot improve any KPI without harming some other KPI. If we expect those optimizations to be easy and to happen all the time, then we should expect to usually end up at pareto optima.

And if we’re already at a pareto optimum, and we start optimizing for some proxy objective, then we’re definitely going to harm all the other objectives. That’s the whole point of pareto optimality, after all: we can’t improve one thing without trading off against something else. That doesn’t mean that we’ll see net harm to the true objective right away; even if we’re pareto optimal, we could be starting from a point with far too few nails produced. If the factory has a culture of unnecessary perfectionism, then pushing for higher nail count may help. But keep pushing, and we’ll slide down the pareto curve past the optimal point and into unhappy territory. That’s the mark of a Goodhart effect.