So a highly competent preference helps predict those preferences. But I’m confused on how “violating one-sided competent preferences” makes sense with Goodhart’s law.
As an example, “Prefer 2 bananas over 1” can be very competent if it correctly predicts preference in a wide range of scenarios (eg different parts of the day, after anti-banana propaganda,etc) with incompetent meaning it’s prediction is wrong (max entropy or opposite of correct?). Assuming it’s competent, what does violating this preference mean? That the AI predicted 1 banana over 2 or that the simple rule “Prefers 2 over 1″ didn’t actually apply?
By “violate a preference,” I mean that the preference doesn’t get satisfied—so if the human competently prefers 2 bananas but only got 1 banana, their preference has been violated.
But maybe you mean something along the lines of “If competent preferences are really broadly predictive, then wouldn’t it be even more predictive to infer the preference ‘the human prefers 2 bananas except when the AI gives them 1’, since that would more accurately predict how many bananas the humans gets? This would sort of paint us into a corner where it’s hard to violate competent preferences as defined.”
My response would be that competence is based off of how predictive and efficient the model is (just to reiterate, preferences live inside a model of the world), not how often you get what you want. Even if you never get 2 bananas and have only gotten 1 banana your entire life, a model that predicts that you want 2 bananas can still be competent if the hypothesis of you wanting 2 bananas helps explain how you’ve reacted to your life as a 1-banana-getter.
So a highly competent preference helps predict those preferences. But I’m confused on how “violating one-sided competent preferences” makes sense with Goodhart’s law.
As an example, “Prefer 2 bananas over 1” can be very competent if it correctly predicts preference in a wide range of scenarios (eg different parts of the day, after anti-banana propaganda,etc) with incompetent meaning it’s prediction is wrong (max entropy or opposite of correct?). Assuming it’s competent, what does violating this preference mean? That the AI predicted 1 banana over 2 or that the simple rule “Prefers 2 over 1″ didn’t actually apply?
By “violate a preference,” I mean that the preference doesn’t get satisfied—so if the human competently prefers 2 bananas but only got 1 banana, their preference has been violated.
But maybe you mean something along the lines of “If competent preferences are really broadly predictive, then wouldn’t it be even more predictive to infer the preference ‘the human prefers 2 bananas except when the AI gives them 1’, since that would more accurately predict how many bananas the humans gets? This would sort of paint us into a corner where it’s hard to violate competent preferences as defined.”
My response would be that competence is based off of how predictive and efficient the model is (just to reiterate, preferences live inside a model of the world), not how often you get what you want. Even if you never get 2 bananas and have only gotten 1 banana your entire life, a model that predicts that you want 2 bananas can still be competent if the hypothesis of you wanting 2 bananas helps explain how you’ve reacted to your life as a 1-banana-getter.