That is a different phenomenon than the thing I was getting at with the particular block Daniel quoted at top-of-thread. It is, however, an excellent example (better than the example I used) of the sort of thing the “metaphilosophy” section of the post was getting at.
Yeah, I guess that block was about more concrete issues with the “humans rate things” setup? And what I’ve outlined is more of a… mirror of it?
Here’s a different example. Imagine feeding the AI a dataset consisting of a bunch of ethical dilemmas, and thumbing it up every time it does something “good” according to you. Your goal is to grow something which cares about the human flourishing, maybe a consequentialist utilitarian, and you think that’s the way to go. But in actuality, your deontology is very flawed, so in actuality you grow a bullet-biting evil deontologist. I think that’s analogous to the human raters setup, right?
And then the equal-and-opposite failure mode is if you’re feeding the AI some ethics dataset in an attempt to teach it deontological injunctions, but it actually distills them into “consequentialistic utilitarianism”, in a surprising and upsetting-to-you manner.
That is a different phenomenon than the thing I was getting at with the particular block Daniel quoted at top-of-thread. It is, however, an excellent example (better than the example I used) of the sort of thing the “metaphilosophy” section of the post was getting at.
Yeah, I guess that block was about more concrete issues with the “humans rate things” setup? And what I’ve outlined is more of a… mirror of it?
Here’s a different example. Imagine feeding the AI a dataset consisting of a bunch of ethical dilemmas, and thumbing it up every time it does something “good” according to you. Your goal is to grow something which cares about the human flourishing, maybe a consequentialist utilitarian, and you think that’s the way to go. But in actuality, your deontology is very flawed, so in actuality you grow a bullet-biting evil deontologist. I think that’s analogous to the human raters setup, right?
And then the equal-and-opposite failure mode is if you’re feeding the AI some ethics dataset in an attempt to teach it deontological injunctions, but it actually distills them into “consequentialistic utilitarianism”, in a surprising and upsetting-to-you manner.