Questions for people working or thinking of working in this area:
Is there a way to have an AI “understand” that the values it is learning are not terminal values, or even instrumental values, but “interim” values? That is, they are things that humans want, subject to the fact that the humans are still trying to figure out what their real values are so the AI shouldn’t be too attached to those values. Maybe it’s possible to stretch the “utility function + mistakes” model to cover this, but it seems like it would be much better if there was a more natural / elegant way to model these “interim” values.
Relatedly, is there a way to apply value learning to the problem of metaphilosophy? In other words, can an AI, by observing humans try to solve philosophical problems, learn how to solve philosophical problems and exceed human level performance?
If the answer to the above question is “no” or “it’s too hard”, it may seem sufficient that an AI can just learn not to interfere with or manipulate a human’s philosophical and moral deliberations. This may be much easier, but if we’re headed towards a multi-polar world of AIs that are aligned to different users/owners, we also need our AIs to protect us against manipulation from other-aligned AIs. Such an AI would seemingly need to distinguish between attempts of manipulation and helpful (or at least good-faith) discussion (otherwise how would we talk with anyone else in the world without risking AI manipulation). But being able to make such distinctions seems a small step away from the ability to be actively helpful, so this problem doesn’t seem much easier than learning how to do philosophical reasoning. Still, it may be useful to consider this as a separate problem just in case it is much easier.
Uncertainty over utility functions + a prior that there are systematic mistakes might be enough to handle this, but I agree that this problem seems hard and not yet tackled in the literature. I personally lean towards “expected explicit utility maximizers are the wrong framework to use”.
I don’t know yet, but researchers have some preliminary thoughts, which I’m hoping to write about in the future. Also I realized that what I actually meant to say is “expected explicit utility maximizers are the wrong framework to use”, not utility functions—I’ve edited the parent comment to reflect this. CIRL comes to mind as published work that’s moving in a direction away from “expected explicit utility maximizers”, even though it does involve a reward function—it involves a human-robot system that together are optimizing some expected utility, but the robot itself is not maximizing some explicitly represented utility function.
Questions for people working or thinking of working in this area:
Is there a way to have an AI “understand” that the values it is learning are not terminal values, or even instrumental values, but “interim” values? That is, they are things that humans want, subject to the fact that the humans are still trying to figure out what their real values are so the AI shouldn’t be too attached to those values. Maybe it’s possible to stretch the “utility function + mistakes” model to cover this, but it seems like it would be much better if there was a more natural / elegant way to model these “interim” values.
Relatedly, is there a way to apply value learning to the problem of metaphilosophy? In other words, can an AI, by observing humans try to solve philosophical problems, learn how to solve philosophical problems and exceed human level performance?
If the answer to the above question is “no” or “it’s too hard”, it may seem sufficient that an AI can just learn not to interfere with or manipulate a human’s philosophical and moral deliberations. This may be much easier, but if we’re headed towards a multi-polar world of AIs that are aligned to different users/owners, we also need our AIs to protect us against manipulation from other-aligned AIs. Such an AI would seemingly need to distinguish between attempts of manipulation and helpful (or at least good-faith) discussion (otherwise how would we talk with anyone else in the world without risking AI manipulation). But being able to make such distinctions seems a small step away from the ability to be actively helpful, so this problem doesn’t seem much easier than learning how to do philosophical reasoning. Still, it may be useful to consider this as a separate problem just in case it is much easier.
Uncertainty over utility functions + a prior that there are systematic mistakes might be enough to handle this, but I agree that this problem seems hard and not yet tackled in the literature. I personally lean towards “expected explicit utility maximizers are the wrong framework to use”.
What framework do you use?
I don’t know yet, but researchers have some preliminary thoughts, which I’m hoping to write about in the future. Also I realized that what I actually meant to say is “expected explicit utility maximizers are the wrong framework to use”, not utility functions—I’ve edited the parent comment to reflect this. CIRL comes to mind as published work that’s moving in a direction away from “expected explicit utility maximizers”, even though it does involve a reward function—it involves a human-robot system that together are optimizing some expected utility, but the robot itself is not maximizing some explicitly represented utility function.