This paper gives a mathematical model of when Goodharting will occur. To summarize: if

(1) a human has some collection of things which she values,

(2) a robot has access to a proxy utility function which takes into account some strict subset of those things, and

(3) the robot can freely vary how much of there are in the world, subject only to resource constraints that make the trade off against each other,

then when the robot optimizes for its proxy utility, it will minimize all ‘s which its proxy utility function doesn’t take into account. If you impose a further condition which ensures that you can’t get too much utility by only maximizing some strict subset of the ’s (e.g. assuming diminishing marginal returns), then the optimum found by the robot will be suboptimal for the human’s true utility function.

That said, I wasn’t super-impressed by this paper—the above is pretty obvious and the mathematical model doesn’t elucidate anything, IMO.

Moreover, I think this model doesn’t interact much with the skeptical take about whether Goodhart’s Law implies doom in practice. Namely, here are some things I believe about the world which this model doesn’t take into account:

(1) Lots of the things we value are correlated with each other over “realistically attainable” distributions of world states. Or in other words, for many pairs of things we care about, it is hard (concretely, requires a very capable AI) to increase the amount of without also increasing the amount of .

(2) The utility functions of future AIs will be learned from humans in such a way that as the capabilities of AI systems increase, so will their ability to model human preferences.

If (1) is true, then for each given capabilities level, there is some room for error for our proxy utility functions (within which an agent at that capabilities level won’t be able to decouple our proxy utility function from our true utility function); this permissible error margin shrinks with increasing capabilities. If you buy (2), then you *might* additionally think that the actual error margin between learned proxy utility functions and our true utility function will shrink more rapidly than the permissible error margin as AI capabilities grow. (Whether or not you actually *do* believe that value learning will beat capabilities in this race probably depends on a whole lot of other empirical beliefs, or so it seems to me.)

This thread (which you might have already seen) has some good discussion about whether Goodharting will be a big problem in practice.

Hmm, I’m not sure I understand—it doesn’t seem to me like noisy observations ought to pose a big problem to control systems in general.

For example, suppose we want to minimize the number of mosquitos in the U.S., and we access to noisy estimates of mosquito counts in each county. This may result in us allocating resources slightly inefficiently (e.g. overspending resources on counties that have fewer mosquitos than we think), but we’ll still always be doing the approximately correct thing and mosquito counts will go down. In particular, I don’t see a sense in which the error “comes to dominate” the thing we’re optimizing.

One concern which

doesmake sense to me (and I’m not sure if I’m steelmanning your point or just saying something completely different) is that under extreme optimization pressure, measurements might becomedecoupledfrom the thing they’re supposed to measure. In the mosquito example, this would look like us bribing the surveyors to report artificially low mosquito counts instead of actually trying to affect real-world mosquito counts.If this is your primary concern regarding Goodhart’s Law, then I agree the model above doesn’t obviously capture it. I guess it’s more precisely a model of proxy misspecification.