I only skimmed a little through the post I’m linking to, but I’m curios if the method of self-other-overlap could help “keep AI meta-ethical evolution grounded to human preferences”:
My own high-level, vaguely defined guess of a method would be something that is central to the functioning of the AI such that if the AI goes against it, then the AI will not be able to make sense of the world. But that seems to carry the risk of the AI just messing everything up as it goes crazy. So the method should also include a way of limiting the capabilities of the AI while it’s in that confused state.
In short, no, I don’t expect self-other overlap to help. If the human wants coffee, we want the AI to get the human a coffee. We don’t want the AI to get itself a coffee.
Second, the problem isn’t that we know what we want the AI to do, but are worried the AI will “go against it,” so we need to constrain the AI. The problem is that we don’t know what we want the AI to do, certainly not with enough precision to turn it into code.
In value learning, we want the AI to model human preferences, but we also want the AI to do meta-preferential activities like considering the preferences of individual humans and aggregating them together, or considering different viewpoints on what ‘human preferences’ means and aggregating them together. And we don’t just want the AI to do those in arbitrary ways, we want it to learn good ways to navigate different viewpoints from humans’ own intuitions about what it means to do a good job at that.
“If the human wants coffee, we want the AI to get the human a coffee. We don’t want the AI to get itself a coffee.” It’s not clear to me that this is the only possible outcome. It’s not a mistake that we humas do routinely. In fact, there is some evidence that if someone asks us to do them a favor, we might end up liking them more and continue to do more favors for that person. Granted, there seem to have been no large-scale studies analyzing this so called Ben Franklin effect. Even if this effect does turn out to be more robust, it’s not clear to me how this could transfer to an AI. And then there’s the issue of making sure the AI won’t somehow get rid of this constraint that we imposed on it.
I only skimmed a little through the post I’m linking to, but I’m curios if the method of self-other-overlap could help “keep AI meta-ethical evolution grounded to human preferences”:
https://www.lesswrong.com/posts/jtqcsARGtmgogdcLT/reducing-llm-deception-at-scale-with-self-other-overlap-fine
My own high-level, vaguely defined guess of a method would be something that is central to the functioning of the AI such that if the AI goes against it, then the AI will not be able to make sense of the world. But that seems to carry the risk of the AI just messing everything up as it goes crazy. So the method should also include a way of limiting the capabilities of the AI while it’s in that confused state.
In short, no, I don’t expect self-other overlap to help. If the human wants coffee, we want the AI to get the human a coffee. We don’t want the AI to get itself a coffee.
Second, the problem isn’t that we know what we want the AI to do, but are worried the AI will “go against it,” so we need to constrain the AI. The problem is that we don’t know what we want the AI to do, certainly not with enough precision to turn it into code.
In value learning, we want the AI to model human preferences, but we also want the AI to do meta-preferential activities like considering the preferences of individual humans and aggregating them together, or considering different viewpoints on what ‘human preferences’ means and aggregating them together. And we don’t just want the AI to do those in arbitrary ways, we want it to learn good ways to navigate different viewpoints from humans’ own intuitions about what it means to do a good job at that.
“If the human wants coffee, we want the AI to get the human a coffee. We don’t want the AI to get itself a coffee.”
It’s not clear to me that this is the only possible outcome. It’s not a mistake that we humas do routinely. In fact, there is some evidence that if someone asks us to do them a favor, we might end up liking them more and continue to do more favors for that person. Granted, there seem to have been no large-scale studies analyzing this so called Ben Franklin effect. Even if this effect does turn out to be more robust, it’s not clear to me how this could transfer to an AI. And then there’s the issue of making sure the AI won’t somehow get rid of this constraint that we imposed on it.
”The problem is that we don’t know what we want the AI to do, certainly not with enough precision to turn it into code.”
I agree; that’s backed up by the findings from the Moral Machine experiment about what we think autonomous cars should do.