Ajeya Cotra comments on The case for aligning narrowly superhuman models

Ajeya Cotra 6 Mar 2021 17:02 UTC
LW: 1 AF: 1
AF

My biggest concern is actually that the problem is going to be too easy for supervised learning. Need GPT-3 to dispense expert medical advice? Fine-tune it on a corpus of expert medical advice! Or for slightly more sophistication, fine-tune it to predict advice plus a score for how good the advice was, then condition on the score being high!

I don’t think you can get away with supervised learning if you’re holding yourself to the standard of finding fuzzy tasks where the model is narrowly superhuman. E.g. the Stiennon et al., 2020 paper involved using RL from human feedback: roughly speaking, that’s how it was possible for the model to actually improve upon humans rather than simply imitating them. And I think in some cases, the model will be capable of doing better than (some) humans’ evaluations, meaning that to “get models to the best they can to help us” we will probably need to do things like decomposition, training models to explain their decisions, tricks to amplify or de-noise human feedback, etc.

There’s also some unavoidable conceptual progress needed (You can fine-tune GPT-3 for medical advice with little philosophical worry, but how do you fine-tune GPT-3 for moral advice? Okay, now that you thought of the obvious answer, what’s wrong with it?)

I don’t agree that there’s obviously conceptual progress that’s necessary for moral advice which is not necessary for medical advice — I’d expect a whole class of tasks to require similar types of techniques, and if there’s a dividing line I don’t think it is going to be “whether it’s related to morality”, but “whether it’s difficult for the humans doing the evaluation to tell what’s going on.” To answer your question for both medical and moral advice, I’d say the obvious first thought is RL from human feedback, and the second thought I had to go beyond that is trying to figure out how to get less-capable humans to replicate the training signal produced by more-capable humans, without using any information/expertise from the latter to help the former (the “sandwiching” idea). I’m not sure if it’ll work out though.
- Charlie Steiner 6 Mar 2021 23:22 UTC
  LW: 2 AF: 1
  AF Parent
  Re: part 1 -
  
  Good points, I agree. Though I think you could broadly replicate the summarization result using supervised learning—the hope for using supervised learning in superhuman domains is that your model learns a dimension of variation for “goodness” that can generalize well even if you condition on “goodness” being slightly outside any of the training examples.
  
  Re: part 2 -
  
  What it boils down to is that my standards (and I think the practical standards) for medical advice are low, while my standards for moral advice are high (as in, you could use this to align AGI). I agree that there’s no magic property a moral question has that no medical question could have. But there are non-magical properties I expect to be relevant.
  
  With medical advice from a text model, I’m not expecting it to learn a detailed model of the human body and be able to infer new medical conditions and treatments that human experts haven’t figured out yet. I’m just expecting it to do verbal reasoning to arrive at the same substantive advice a human expert would give, maybe packaged in a slightly superhuman good explanation.
  
  With moral advice, though, ask 3 human experts and you’ll get 4 opinions. This is made worse by the fact that I’ve sneakily increased the size of the problem—“moral advice” can be about almost anything. Was it bad to pull the plug on Terry Schiavo? How much of your income should you give to charity? Is it okay to kiss your cousin twice removed? Is it a good future if all the humans are destructively copied to computers? Should we run human challenge trials for covid-19 vaccines?
  
  Medical advice seems to be in the “supervisable regime,” where it’s fulfilled its promise by merely telling us things that human experts know. Moral advice is very not, because humans aren’t consistent about morality in the same way they can be about medicine.
  
  If MTurkers are on average anti-abortion and your experts are on average pro-choice, what the hell will your MTurkers think about training an algorithm that tries to learn from anti-abortion folks and output pro-choice responses? Suppose you then run that same algorithm on the experts and it gives outputs in favor of legalizing infanticide—are the humans allowed to say “hold on, I don’t want that,” or are we just going to accept that as what peak performance looks like? So anyhow I’m pessimistic about sandwiching for moral questions.
  
  Getting better at eliciting human preferences does seem important, but again it has more wrinkles than for medicine. We have metapreferences (preferences about our preferences, or about how to resolve our own inconsistencies) that have few analogues in medicine. This immediately thrusts us into the domain beyond human capacity for direct evaluation. So I absolutely agree with you that we should be seeking out problems in this domain and trying to make progress on them. But I’m still pretty confident that we’re missing some conceptual tools for doing well on these problems.