By “violate a preference,” I mean that the preference doesn’t get satisfied—so if the human competently prefers 2 bananas but only got 1 banana, their preference has been violated.
But maybe you mean something along the lines of “If competent preferences are really broadly predictive, then wouldn’t it be even more predictive to infer the preference ‘the human prefers 2 bananas except when the AI gives them 1’, since that would more accurately predict how many bananas the humans gets? This would sort of paint us into a corner where it’s hard to violate competent preferences as defined.”
My response would be that competence is based off of how predictive and efficient the model is (just to reiterate, preferences live inside a model of the world), not how often you get what you want. Even if you never get 2 bananas and have only gotten 1 banana your entire life, a model that predicts that you want 2 bananas can still be competent if the hypothesis of you wanting 2 bananas helps explain how you’ve reacted to your life as a 1-banana-getter.
By “violate a preference,” I mean that the preference doesn’t get satisfied—so if the human competently prefers 2 bananas but only got 1 banana, their preference has been violated.
But maybe you mean something along the lines of “If competent preferences are really broadly predictive, then wouldn’t it be even more predictive to infer the preference ‘the human prefers 2 bananas except when the AI gives them 1’, since that would more accurately predict how many bananas the humans gets? This would sort of paint us into a corner where it’s hard to violate competent preferences as defined.”
My response would be that competence is based off of how predictive and efficient the model is (just to reiterate, preferences live inside a model of the world), not how often you get what you want. Even if you never get 2 bananas and have only gotten 1 banana your entire life, a model that predicts that you want 2 bananas can still be competent if the hypothesis of you wanting 2 bananas helps explain how you’ve reacted to your life as a 1-banana-getter.