Especially from human feedback—can you think of ways for the AI to be able to query humans for feedback, and yet not be able to push uncertainty to zero via infinite feedback? Does this create perverse incentives? What do more/less principled ways of doing this look like?
My intuition right now is that in the infinite feedback case, it would be aligned, but not corrigible because we can specify everything exactly.
The problem is that if it thinks of the ground truth of morality as some fact that’s out there in the world that its supervisory signal has causal access to, it will figure out what corresponds to its “ground truth” in some accurate causal model of the world, and then it will try to optimize that directly.
E.g. you train the AI to get good ratings from humans, but the plan that actually gets maximum rating is one that interferes with the rating-process itself (e.g. by deceiving humans, or hacking the computer).
Of course there are some goals about the world that would be good for an AI to learn—we just don’t know how to write down how to learn them.
The problem is that if it thinks of the ground truth of morality as some fact that’s out there in the world that its supervisory signal has causal access to, it will figure out what corresponds to its “ground truth” in some accurate causal model of the world, and then it will try to optimize that directly.
My critical point is that the ground truth may not actually exist here, so morals are only definable relative to what an agent wants, also called moral anti-realism.
This does introduce a complication, in that manipulation would be effectively impossible to avoid, since it’s effectively arbitrarily controlled. This is actually dangerous, since deceiving a person and helping a person morally blur so easily, if not outright equivalent, and if the infinite limit is not actually aligned, this is a dangerous problem.
My intuition right now is that in the infinite feedback case, it would be aligned, but not corrigible because we can specify everything exactly.
The problem is that if it thinks of the ground truth of morality as some fact that’s out there in the world that its supervisory signal has causal access to, it will figure out what corresponds to its “ground truth” in some accurate causal model of the world, and then it will try to optimize that directly.
E.g. you train the AI to get good ratings from humans, but the plan that actually gets maximum rating is one that interferes with the rating-process itself (e.g. by deceiving humans, or hacking the computer).
Of course there are some goals about the world that would be good for an AI to learn—we just don’t know how to write down how to learn them.
My critical point is that the ground truth may not actually exist here, so morals are only definable relative to what an agent wants, also called moral anti-realism.
This does introduce a complication, in that manipulation would be effectively impossible to avoid, since it’s effectively arbitrarily controlled. This is actually dangerous, since deceiving a person and helping a person morally blur so easily, if not outright equivalent, and if the infinite limit is not actually aligned, this is a dangerous problem.
Why does the infinite limit of value learning matter if we’re doing soft optimization against a fixed utility distribution?
Sorry, I didn’t realize this and I was responding independently to Charlie Steiner.