Yeah, I guess that block was about more concrete issues with the “humans rate things” setup? And what I’ve outlined is more of a… mirror of it?
Here’s a different example. Imagine feeding the AI a dataset consisting of a bunch of ethical dilemmas, and thumbing it up every time it does something “good” according to you. Your goal is to grow something which cares about the human flourishing, maybe a consequentialist utilitarian, and you think that’s the way to go. But in actuality, your deontology is very flawed, so in actuality you grow a bullet-biting evil deontologist. I think that’s analogous to the human raters setup, right?
And then the equal-and-opposite failure mode is if you’re feeding the AI some ethics dataset in an attempt to teach it deontological injunctions, but it actually distills them into “consequentialistic utilitarianism”, in a surprising and upsetting-to-you manner.
Yeah, I guess that block was about more concrete issues with the “humans rate things” setup? And what I’ve outlined is more of a… mirror of it?
Here’s a different example. Imagine feeding the AI a dataset consisting of a bunch of ethical dilemmas, and thumbing it up every time it does something “good” according to you. Your goal is to grow something which cares about the human flourishing, maybe a consequentialist utilitarian, and you think that’s the way to go. But in actuality, your deontology is very flawed, so in actuality you grow a bullet-biting evil deontologist. I think that’s analogous to the human raters setup, right?
And then the equal-and-opposite failure mode is if you’re feeding the AI some ethics dataset in an attempt to teach it deontological injunctions, but it actually distills them into “consequentialistic utilitarianism”, in a surprising and upsetting-to-you manner.