The point is that V and V’ are both hard to define. U is simple, but without a good definition for V, you won’t be able to get a good V’, and if you do have a good V, you can just optimize that directly.
It seems I didn’t articulate my point clearly. What I was saying is that V and V’ are equally hard to define, yet we all assume that true human values has a Goodhart problem (rather than a reverse Goodhart problem). This can’t be because of the complexity (since the complexity is equal) nor because we are maximising a proxy (because both have the same proxy).
So there is something specific about (our knowledge of) human values which causes us to expect Goodhart problems rather than reverse Goodhart problems. It’s not too hard to think of plausible explanations (fragility of value can be re-expressed in terms of simple underlying variables to get results like this), but it does need explaining. And it might not always be valid (eg if we used different underlying variables, such as the smooth-mins of the ones we previously used, then fragility of value and Goodhart effects are much weaker), so we may need to worry about them less in some circumstances.
Sorry, why are V and V’ equally hard to define? Like if V is “human flourishing” and U is GDP then V’ is “twice GDP minus human flourishing” which is more complicated than V. I guess you’re gonna say “Why not say that V is twice GDP minus human flourishing?”? But my point is: for any particular set U,V, V’, you can’t claim that V and V’ are equally simple, and you can’t claim that V and V’ are equally correlated with U. Right?
The point is that V and V’ are both hard to define. U is simple, but without a good definition for V, you won’t be able to get a good V’, and if you do have a good V, you can just optimize that directly.
It seems I didn’t articulate my point clearly. What I was saying is that V and V’ are equally hard to define, yet we all assume that true human values has a Goodhart problem (rather than a reverse Goodhart problem). This can’t be because of the complexity (since the complexity is equal) nor because we are maximising a proxy (because both have the same proxy).
So there is something specific about (our knowledge of) human values which causes us to expect Goodhart problems rather than reverse Goodhart problems. It’s not too hard to think of plausible explanations (fragility of value can be re-expressed in terms of simple underlying variables to get results like this), but it does need explaining. And it might not always be valid (eg if we used different underlying variables, such as the smooth-mins of the ones we previously used, then fragility of value and Goodhart effects are much weaker), so we may need to worry about them less in some circumstances.
Sorry, why are V and V’ equally hard to define? Like if V is “human flourishing” and U is GDP then V’ is “twice GDP minus human flourishing” which is more complicated than V. I guess you’re gonna say “Why not say that V is twice GDP minus human flourishing?”? But my point is: for any particular set U,V, V’, you can’t claim that V and V’ are equally simple, and you can’t claim that V and V’ are equally correlated with U. Right?
Almost equally hard to define. You just need to define U, which, by assumption, is easy.