jimmy comments on The reverse Goodhart problem

jimmy 9 Jun 2021 21:23 UTC
5 points
0

This is one of the times it helps to visualize things to see what’s going on.

Let’s pick target shooting for example, since it’s easy to picture and makes for a good metaphor. The goal is to get as close as possible to the bulls eye, and for each inch of miss you score one less point. Visually, you see a group of concentric “rings” around the bulls eye which score fewer and fewer points as they get bigger. Simplifying to one dimension for a moment, V = -abs(x).
However, it’s not easy to point the rifle right at the bulls eye. You do your best, of course, and it’s much much closer to the bulls eye than any random orientation would be, but maybe you end up aiming one inch to the right, and that the more accurate your ammo is the closer you get to this aimpoint of x=1. This makes U = -abs(1-x), or -abs(1-x)+constant or whatever. It doesn’t really matter, but if we pick -abs(1-x)+1, U = V when you miss sufficiently far to the left so it fits nicely with your picture.
When we plot U, V, and 2U-V, we can see that your mathematical truth holds and it looks immediately suspicious. Going back to two dimensions, instead of having nice concentric rings around the actual target, you’re pointing out that if the bulls eye had instead been placed exactly where you ended up aiming, and if the rings were distorted and non-concentric in this certain way, then V would actually increase twice as fast as U.
But it’s sorta missing the point. Because for one, the absolute scaling is fairly meaningless in the first place because it brings you towards the same place anyway, and more importantly you don’t get the luxury of drawing your bullseye after you shoot. If you had been aiming for V’ in the first place, you almost certainly wouldn’t have managed to pull off a proxy as perfect as U. (in general V’ and U don’t have to line up in the exact same spot like this, but in those cases you still wouldn’t have happened to miss V’ in this particular way)

Goodhart has nothing to do with human values being “funny”, it has to do with the fundamental difficulty of setting your sights in just the right place. Once you’re within the range of the distance between your proxy and actual goal, it’s no longer guaranteed that getting closer to the proxy gets you closer to your goal and it can actually bring you further away—and if it brings you further away, that’s bad. If you did a good job on all axes, maybe you end up hitting the 9 ring and that’s good enough.
The thing that makes it “inevitable disaster” rather than just “not suboptimal improvement” is when you forget to take into account a whole dimension. Say, if you aim your rifle well in azimuth and elevation but instead of telling the bullet to stop at a certain distance, you tell it to keep going in that direction forever and it manages to succeed well beyond the target range.