Am I correct that the human uncertainty about “true values” (or more naturalistically, the underdetermination of how to model humans as having values) isn’t actually an active ingredient in the toy problem?
I.e. you start an AI, and it knows it’s going to get some observations about humans, model them as having values, and then act to fulfill those values. But if it’s updateless, it will have a prior probability distribution over what values it would land on, and it will take the prior expectation and maximize that, basically preventing value learning from taking place.
What do you think about the cheap fix, where we say “oops, that was a mistake, we gave the AI the preferences ‘globally maximize the modeled pattern from unknown data,’ when we should have given it the preferences ‘locally maximize the modeled pattern from unknown data,’ i.e. prefer that your outputs match the observed pattern, not that your outputs are globally right.”
Am I correct that the human uncertainty about “true values” (or more naturalistically, the underdetermination of how to model humans as having values) isn’t actually an active ingredient in the toy problem?
I.e. you start an AI, and it knows it’s going to get some observations about humans, model them as having values, and then act to fulfill those values. But if it’s updateless, it will have a prior probability distribution over what values it would land on, and it will take the prior expectation and maximize that, basically preventing value learning from taking place.
What do you think about the cheap fix, where we say “oops, that was a mistake, we gave the AI the preferences ‘globally maximize the modeled pattern from unknown data,’ when we should have given it the preferences ‘locally maximize the modeled pattern from unknown data,’ i.e. prefer that your outputs match the observed pattern, not that your outputs are globally right.”