Thanks for writing this! It always felt like a blind spot to me that we only have Goodhart’s law that says “if X is a proxy for Y and you optimize X, the correlation breaks” but we really mean a stronger version: “if you optimize X, Y will actively decrease”. Your paper clarifies that what we actually mean is an intermediate version: “if you optimize X, it becomes a harder to optimize Y”. My conclusion would be that the intermediate version is true but the strong version false then. Would you say that’s an accurate summary?
My conclusion would be that the intermediate version is true but the strong version false then. Would you say that’s an accurate summary?
I’m not totally sure I fully follow the conclusion, but I’ll take a shot at answering—correct me if it seems like I’m talking past you.
Taking Y to be some notion of human values, I think it’s both true that Y actively decreases and Y becomes harder for us to optimize. Both of these are caused, I think, by the agent’s drive to take power / resources from us. If this weren’t true, we might expect to see only “evil” objectives inducing catastrophically bad outcomes.
I should’ve specified that the strong version is “Y decreases relative to a world where neither of X nor Y are being optimized”. Am I right that this version is not true?
If X is “number of paperclips” and Y is something arbitrary that nobody optimizes, such as the ratio of number of bicycles on the moon to flying horses, optimizing X should be equally likely to increase or decrease Y in expectation. Otherwise “1-Y” would go in the opposite direction which can’t be true by symmetry. But if Y is something like “number of happy people”, Y will probably decrease because the world is already set up to keep Y up and a misaligned agent could disturb that state.
That makes sense, thanks. I then agree that it isn’t always true that Y actively decreases, but it should generally become harder for us to optimize. This is the difference between a utility decrease and an attainable utility decrease.
Thanks for writing this! It always felt like a blind spot to me that we only have Goodhart’s law that says “if X is a proxy for Y and you optimize X, the correlation breaks” but we really mean a stronger version: “if you optimize X, Y will actively decrease”. Your paper clarifies that what we actually mean is an intermediate version: “if you optimize X, it becomes a harder to optimize Y”. My conclusion would be that the intermediate version is true but the strong version false then. Would you say that’s an accurate summary?
I’m not totally sure I fully follow the conclusion, but I’ll take a shot at answering—correct me if it seems like I’m talking past you.
Taking Y to be some notion of human values, I think it’s both true that Y actively decreases and Y becomes harder for us to optimize. Both of these are caused, I think, by the agent’s drive to take power / resources from us. If this weren’t true, we might expect to see only “evil” objectives inducing catastrophically bad outcomes.
I should’ve specified that the strong version is “Y decreases relative to a world where neither of X nor Y are being optimized”. Am I right that this version is not true?
I don’t immediately see why this wouldn’t be true as well as the “intermediate version”. Can you expand?
If X is “number of paperclips” and Y is something arbitrary that nobody optimizes, such as the ratio of number of bicycles on the moon to flying horses, optimizing X should be equally likely to increase or decrease Y in expectation. Otherwise “1-Y” would go in the opposite direction which can’t be true by symmetry. But if Y is something like “number of happy people”, Y will probably decrease because the world is already set up to keep Y up and a misaligned agent could disturb that state.
That makes sense, thanks. I then agree that it isn’t always true that Y actively decreases, but it should generally become harder for us to optimize. This is the difference between a utility decrease and an attainable utility decrease.