Charlie Steiner comments on The case for aligning narrowly superhuman models

Charlie Steiner 6 Mar 2021 23:22 UTC
LW: 2 AF: 1
0
AF
Re: part 1 -

Good points, I agree. Though I think you could broadly replicate the summarization result using supervised learning—the hope for using supervised learning in superhuman domains is that your model learns a dimension of variation for “goodness” that can generalize well even if you condition on “goodness” being slightly outside any of the training examples.

Re: part 2 -

What it boils down to is that my standards (and I think the practical standards) for medical advice are low, while my standards for moral advice are high (as in, you could use this to align AGI). I agree that there’s no magic property a moral question has that no medical question could have. But there are non-magical properties I expect to be relevant.

With medical advice from a text model, I’m not expecting it to learn a detailed model of the human body and be able to infer new medical conditions and treatments that human experts haven’t figured out yet. I’m just expecting it to do verbal reasoning to arrive at the same substantive advice a human expert would give, maybe packaged in a slightly superhuman good explanation.

With moral advice, though, ask 3 human experts and you’ll get 4 opinions. This is made worse by the fact that I’ve sneakily increased the size of the problem—“moral advice” can be about almost anything. Was it bad to pull the plug on Terry Schiavo? How much of your income should you give to charity? Is it okay to kiss your cousin twice removed? Is it a good future if all the humans are destructively copied to computers? Should we run human challenge trials for covid-19 vaccines?

Medical advice seems to be in the “supervisable regime,” where it’s fulfilled its promise by merely telling us things that human experts know. Moral advice is very not, because humans aren’t consistent about morality in the same way they can be about medicine.

If MTurkers are on average anti-abortion and your experts are on average pro-choice, what the hell will your MTurkers think about training an algorithm that tries to learn from anti-abortion folks and output pro-choice responses? Suppose you then run that same algorithm on the experts and it gives outputs in favor of legalizing infanticide—are the humans allowed to say “hold on, I don’t want that,” or are we just going to accept that as what peak performance looks like? So anyhow I’m pessimistic about sandwiching for moral questions.

Getting better at eliciting human preferences does seem important, but again it has more wrinkles than for medicine. We have metapreferences (preferences about our preferences, or about how to resolve our own inconsistencies) that have few analogues in medicine. This immediately thrusts us into the domain beyond human capacity for direct evaluation. So I absolutely agree with you that we should be seeking out problems in this domain and trying to make progress on them. But I’m still pretty confident that we’re missing some conceptual tools for doing well on these problems.