In short, no, I don’t expect self-other overlap to help. If the human wants coffee, we want the AI to get the human a coffee. We don’t want the AI to get itself a coffee.
Second, the problem isn’t that we know what we want the AI to do, but are worried the AI will “go against it,” so we need to constrain the AI. The problem is that we don’t know what we want the AI to do, certainly not with enough precision to turn it into code.
In value learning, we want the AI to model human preferences, but we also want the AI to do meta-preferential activities like considering the preferences of individual humans and aggregating them together, or considering different viewpoints on what ‘human preferences’ means and aggregating them together. And we don’t just want the AI to do those in arbitrary ways, we want it to learn good ways to navigate different viewpoints from humans’ own intuitions about what it means to do a good job at that.
“If the human wants coffee, we want the AI to get the human a coffee. We don’t want the AI to get itself a coffee.” It’s not clear to me that this is the only possible outcome. It’s not a mistake that we humas do routinely. In fact, there is some evidence that if someone asks us to do them a favor, we might end up liking them more and continue to do more favors for that person. Granted, there seem to have been no large-scale studies analyzing this so called Ben Franklin effect. Even if this effect does turn out to be more robust, it’s not clear to me how this could transfer to an AI. And then there’s the issue of making sure the AI won’t somehow get rid of this constraint that we imposed on it.
In short, no, I don’t expect self-other overlap to help. If the human wants coffee, we want the AI to get the human a coffee. We don’t want the AI to get itself a coffee.
Second, the problem isn’t that we know what we want the AI to do, but are worried the AI will “go against it,” so we need to constrain the AI. The problem is that we don’t know what we want the AI to do, certainly not with enough precision to turn it into code.
In value learning, we want the AI to model human preferences, but we also want the AI to do meta-preferential activities like considering the preferences of individual humans and aggregating them together, or considering different viewpoints on what ‘human preferences’ means and aggregating them together. And we don’t just want the AI to do those in arbitrary ways, we want it to learn good ways to navigate different viewpoints from humans’ own intuitions about what it means to do a good job at that.
“If the human wants coffee, we want the AI to get the human a coffee. We don’t want the AI to get itself a coffee.”
It’s not clear to me that this is the only possible outcome. It’s not a mistake that we humas do routinely. In fact, there is some evidence that if someone asks us to do them a favor, we might end up liking them more and continue to do more favors for that person. Granted, there seem to have been no large-scale studies analyzing this so called Ben Franklin effect. Even if this effect does turn out to be more robust, it’s not clear to me how this could transfer to an AI. And then there’s the issue of making sure the AI won’t somehow get rid of this constraint that we imposed on it.
”The problem is that we don’t know what we want the AI to do, certainly not with enough precision to turn it into code.”
I agree; that’s backed up by the findings from the Moral Machine experiment about what we think autonomous cars should do.