From a ‘real alignment’ perspective (how to get the AI to want to do good things and not bad things), I think there are some obvious implications for the future of RLAIF.
You might think of the label ‘RLAIF’ as standing in for the general strategy of leveraging unsupervised data about human behavior to point the AI towards human preferences, using a scaffold that solicits the AI’s predictions (or more general generative output, if the training isn’t for pure prediction) about human preference-laden behaviors, and then transforms those predictions into some sort of supervisory signal.
Similarly, the AZR setup leverages the AI’s unsupervised knowledge of code-quality-laden behaviors, using a scaffold that turns them back into a reward signal that lets the AI quote-unquote “train itself” to code better. Except that relative to vanilla RLAIF, there’s more of an emphasis on generating and solving specific problems that form a curriculum for the agent, rather than just responding well to samples from the training distribution. But now that I’ve described things in this way, you can probably see how to turn this back into RLAIF for alignment.
The overarching problem is, as usual, we don’t understand how to do alignment in a non-hacky way.
We don’t know what sorts of moral reflection are necessary for good outcomes, and we don’t know where human feedback is a necessary ingredient to keep AI meta-ethical evolution grounded to human preferences. But hey, if we try various value learning schemes empirically maybe we’ll learn some things.
I only skimmed a little through the post I’m linking to, but I’m curios if the method of self-other-overlap could help “keep AI meta-ethical evolution grounded to human preferences”:
My own high-level, vaguely defined guess of a method would be something that is central to the functioning of the AI such that if the AI goes against it, then the AI will not be able to make sense of the world. But that seems to carry the risk of the AI just messing everything up as it goes crazy. So the method should also include a way of limiting the capabilities of the AI while it’s in that confused state.
In short, no, I don’t expect self-other overlap to help. If the human wants coffee, we want the AI to get the human a coffee. We don’t want the AI to get itself a coffee.
Second, the problem isn’t that we know what we want the AI to do, but are worried the AI will “go against it,” so we need to constrain the AI. The problem is that we don’t know what we want the AI to do, certainly not with enough precision to turn it into code.
In value learning, we want the AI to model human preferences, but we also want the AI to do meta-preferential activities like considering the preferences of individual humans and aggregating them together, or considering different viewpoints on what ‘human preferences’ means and aggregating them together. And we don’t just want the AI to do those in arbitrary ways, we want it to learn good ways to navigate different viewpoints from humans’ own intuitions about what it means to do a good job at that.
“If the human wants coffee, we want the AI to get the human a coffee. We don’t want the AI to get itself a coffee.” It’s not clear to me that this is the only possible outcome. It’s not a mistake that we humas do routinely. In fact, there is some evidence that if someone asks us to do them a favor, we might end up liking them more and continue to do more favors for that person. Granted, there seem to have been no large-scale studies analyzing this so called Ben Franklin effect. Even if this effect does turn out to be more robust, it’s not clear to me how this could transfer to an AI. And then there’s the issue of making sure the AI won’t somehow get rid of this constraint that we imposed on it.
From a ‘real alignment’ perspective (how to get the AI to want to do good things and not bad things), I think there are some obvious implications for the future of RLAIF.
You might think of the label ‘RLAIF’ as standing in for the general strategy of leveraging unsupervised data about human behavior to point the AI towards human preferences, using a scaffold that solicits the AI’s predictions (or more general generative output, if the training isn’t for pure prediction) about human preference-laden behaviors, and then transforms those predictions into some sort of supervisory signal.
Similarly, the AZR setup leverages the AI’s unsupervised knowledge of code-quality-laden behaviors, using a scaffold that turns them back into a reward signal that lets the AI quote-unquote “train itself” to code better. Except that relative to vanilla RLAIF, there’s more of an emphasis on generating and solving specific problems that form a curriculum for the agent, rather than just responding well to samples from the training distribution. But now that I’ve described things in this way, you can probably see how to turn this back into RLAIF for alignment.
The overarching problem is, as usual, we don’t understand how to do alignment in a non-hacky way.
We don’t know what sorts of moral reflection are necessary for good outcomes, and we don’t know where human feedback is a necessary ingredient to keep AI meta-ethical evolution grounded to human preferences. But hey, if we try various value learning schemes empirically maybe we’ll learn some things.
I only skimmed a little through the post I’m linking to, but I’m curios if the method of self-other-overlap could help “keep AI meta-ethical evolution grounded to human preferences”:
https://www.lesswrong.com/posts/jtqcsARGtmgogdcLT/reducing-llm-deception-at-scale-with-self-other-overlap-fine
My own high-level, vaguely defined guess of a method would be something that is central to the functioning of the AI such that if the AI goes against it, then the AI will not be able to make sense of the world. But that seems to carry the risk of the AI just messing everything up as it goes crazy. So the method should also include a way of limiting the capabilities of the AI while it’s in that confused state.
In short, no, I don’t expect self-other overlap to help. If the human wants coffee, we want the AI to get the human a coffee. We don’t want the AI to get itself a coffee.
Second, the problem isn’t that we know what we want the AI to do, but are worried the AI will “go against it,” so we need to constrain the AI. The problem is that we don’t know what we want the AI to do, certainly not with enough precision to turn it into code.
In value learning, we want the AI to model human preferences, but we also want the AI to do meta-preferential activities like considering the preferences of individual humans and aggregating them together, or considering different viewpoints on what ‘human preferences’ means and aggregating them together. And we don’t just want the AI to do those in arbitrary ways, we want it to learn good ways to navigate different viewpoints from humans’ own intuitions about what it means to do a good job at that.
“If the human wants coffee, we want the AI to get the human a coffee. We don’t want the AI to get itself a coffee.”
It’s not clear to me that this is the only possible outcome. It’s not a mistake that we humas do routinely. In fact, there is some evidence that if someone asks us to do them a favor, we might end up liking them more and continue to do more favors for that person. Granted, there seem to have been no large-scale studies analyzing this so called Ben Franklin effect. Even if this effect does turn out to be more robust, it’s not clear to me how this could transfer to an AI. And then there’s the issue of making sure the AI won’t somehow get rid of this constraint that we imposed on it.
”The problem is that we don’t know what we want the AI to do, certainly not with enough precision to turn it into code.”
I agree; that’s backed up by the findings from the Moral Machine experiment about what we think autonomous cars should do.