Note that our story here isn’t quite “reward misspecification”. That’s why we needed all that machinery about [the AI’s internal concept of] <stuff>. There’s a two-step thing here: the training process gets the AI to optimize for one of its internal concepts, and then that internal concept generalizes differently from whatever-ratings-were-meant-to-proxy-for.
I don’t understand how this is different from reward misspecification.
I have a different example in mind, from the one John provided. @johnswentworth, do mention if I’m misunderstanding what you’re getting at there.
Suppose you train your AI to show respect to your ancestors. Your understanding of what this involves contains things like “preserve accurate history” and “teach the next generations about the ancestors’ deeds” and “pray to the ancestors daily” and “ritually consult the ancestors before making big decisions”.
In the standard reward-misspecification setup, the AI doesn’t actually internalize the intended goal of “respect the ancestors”. Instead, it grows a bunch of values about the upstream correlates of that, like “preserving accurate history” and “doing elaborate ritual dances” (or, more realistically, some completely alien variants of this). It starts to care about the correlates terminally. Then it tiles the universe with dancing books or something, with no “ancestors” mentioned anywhere in them.
In the “unexpected generalization” setup, the AI does end up caring about the ancestors directly. But as it learns more about the world, more than you, its ontology is updated, and it discovers that, why, actually spirits aren’t real and “praying to” and “consulting” the ancestors are just arbitrary behaviors that don’t have anything in particular to do with keeping the ancestors happy and respected. So the AI keeps on telling accurate histories and teaching them, but entirely drops the ritualistic elements of your culture.
But what if actually, what you cared about was preserving your culture? Rituals included, even if you learn that they don’t do anything, because you still want them for the aesthetic/cultural connection?
Well, then you’re out of luck. You thought you knew what you wanted, but your lack of knowledge of the structure of the domain in which you operated foiled you. And the AI doesn’t care; it was taught to respect the ancestors, not be corrigible to your shifting opinions.
It’s similar to the original post’s example of using “zero correlation” as a proxy for “zero mutual information” to minimize information leaks. You think you know what your target is, but you don’t actually know its True Name, so even optimizing for your actual not-Goodharted best understanding of it still leads to unintended outcomes.
“The AI starts to care about making humans rate its actions as good” is a particularly extreme example of it: where whatever concept the humans care about is so confused there’s nothing in reality outside their minds that it corresponds to, so there’s nothing for the AI to latch onto except the raters themselves.
That is a different phenomenon than the thing I was getting at with the particular block Daniel quoted at top-of-thread. It is, however, an excellent example (better than the example I used) of the sort of thing the “metaphilosophy” section of the post was getting at.
Yeah, I guess that block was about more concrete issues with the “humans rate things” setup? And what I’ve outlined is more of a… mirror of it?
Here’s a different example. Imagine feeding the AI a dataset consisting of a bunch of ethical dilemmas, and thumbing it up every time it does something “good” according to you. Your goal is to grow something which cares about the human flourishing, maybe a consequentialist utilitarian, and you think that’s the way to go. But in actuality, your deontology is very flawed, so in actuality you grow a bullet-biting evil deontologist. I think that’s analogous to the human raters setup, right?
And then the equal-and-opposite failure mode is if you’re feeding the AI some ethics dataset in an attempt to teach it deontological injunctions, but it actually distills them into “consequentialistic utilitarianism”, in a surprising and upsetting-to-you manner.
Here’s the whole relevant section, I leave it to you to judge how well this matches whatever you usually picture as “reward misspecification”.
Note that our story here isn’t quite “reward misspecification”. That’s why we needed all that machinery about [the AI’s internal concept of] <stuff>. There’s a two-step thing here: the training process gets the AI to optimize for one of its internal concepts, and then that internal concept generalizes differently from whatever-ratings-were-meant-to-proxy-for.
That distinction matters for e.g. this example:
“Part of the rating process will be ‘seizing the raters and putting them in special ‘thumbs up’-only suits...that’s very very bad.′ In simulation, actions like that will be penalized a lot. If it goes and does that exact thing, that means that our training process didn’t work at all.”
If the AI has a detailed internal model of the training process, and the training process includes sticking the AI in a simulation, then presumably the AI has an internal model of the simulation (including an internal self model). So during training, when this “thumbs-up-only suits” scenario comes up, the AI’s actual reasoning will route through something like:
Ok, I have the opportunity to put these simulated humans in thumbs-up-only suits.
If I do that, then the actual humans who produce the actual ratings will give a bad rating.
Therefore I won’t do that.
… and that reasoning gets reinforced. Then when the AI is out of simulation, it reasons:
Ok, I have the opportunity to put the actual humans who produce the actual ratings in thumbs-up-only suits.
If I do that, then the actual ratings will be great.
Therefore I do that.
(This sounds like a typical “the AI is strategically aware, and knows it is in a simulation” story, and it is. But note two things which are not always present in such stories:
First, there’s a clear reason for the AI to at least consider the hypothesis that it’s in a simulation: by assumption, it has an internal model of the training process, and the training process includes simulating the AI, so the AI has an internal model of itself-in-a-simulation as part of the training process.
Second, the AI’s cognition doesn’t involve any explicit deception, or even any non-myopia; this story all goes through just fine even if it’s only optimizing for single-episode reward during training. It doesn’t need to be planning ahead about getting into deployment, or anything like that, it’s just using an accurate model of the training process.
)
That very last bullet point in particular emphasizes the part which I most expect to diverge from most peoples’ go-to mental picture.
I don’t understand how this is different from reward misspecification.
I have a different example in mind, from the one John provided. @johnswentworth, do mention if I’m misunderstanding what you’re getting at there.
Suppose you train your AI to show respect to your ancestors. Your understanding of what this involves contains things like “preserve accurate history” and “teach the next generations about the ancestors’ deeds” and “pray to the ancestors daily” and “ritually consult the ancestors before making big decisions”.
In the standard reward-misspecification setup, the AI doesn’t actually internalize the intended goal of “respect the ancestors”. Instead, it grows a bunch of values about the upstream correlates of that, like “preserving accurate history” and “doing elaborate ritual dances” (or, more realistically, some completely alien variants of this). It starts to care about the correlates terminally. Then it tiles the universe with dancing books or something, with no “ancestors” mentioned anywhere in them.
In the “unexpected generalization” setup, the AI does end up caring about the ancestors directly. But as it learns more about the world, more than you, its ontology is updated, and it discovers that, why, actually spirits aren’t real and “praying to” and “consulting” the ancestors are just arbitrary behaviors that don’t have anything in particular to do with keeping the ancestors happy and respected. So the AI keeps on telling accurate histories and teaching them, but entirely drops the ritualistic elements of your culture.
But what if actually, what you cared about was preserving your culture? Rituals included, even if you learn that they don’t do anything, because you still want them for the aesthetic/cultural connection?
Well, then you’re out of luck. You thought you knew what you wanted, but your lack of knowledge of the structure of the domain in which you operated foiled you. And the AI doesn’t care; it was taught to respect the ancestors, not be corrigible to your shifting opinions.
It’s similar to the original post’s example of using “zero correlation” as a proxy for “zero mutual information” to minimize information leaks. You think you know what your target is, but you don’t actually know its True Name, so even optimizing for your actual not-Goodharted best understanding of it still leads to unintended outcomes.
“The AI starts to care about making humans rate its actions as good” is a particularly extreme example of it: where whatever concept the humans care about is so confused there’s nothing in reality outside their minds that it corresponds to, so there’s nothing for the AI to latch onto except the raters themselves.
That is a different phenomenon than the thing I was getting at with the particular block Daniel quoted at top-of-thread. It is, however, an excellent example (better than the example I used) of the sort of thing the “metaphilosophy” section of the post was getting at.
Yeah, I guess that block was about more concrete issues with the “humans rate things” setup? And what I’ve outlined is more of a… mirror of it?
Here’s a different example. Imagine feeding the AI a dataset consisting of a bunch of ethical dilemmas, and thumbing it up every time it does something “good” according to you. Your goal is to grow something which cares about the human flourishing, maybe a consequentialist utilitarian, and you think that’s the way to go. But in actuality, your deontology is very flawed, so in actuality you grow a bullet-biting evil deontologist. I think that’s analogous to the human raters setup, right?
And then the equal-and-opposite failure mode is if you’re feeding the AI some ethics dataset in an attempt to teach it deontological injunctions, but it actually distills them into “consequentialistic utilitarianism”, in a surprising and upsetting-to-you manner.
Here’s the whole relevant section, I leave it to you to judge how well this matches whatever you usually picture as “reward misspecification”.
That very last bullet point in particular emphasizes the part which I most expect to diverge from most peoples’ go-to mental picture.