abramdemski comments on The case for aligning narrowly superhuman models

abramdemski 10 Mar 2021 19:30 UTC
LW: 6 AF: 5
AF

I don’t want to drop “superhuman” from the name because that’s the main reason it feels like “practicing what we eventually want to do.”

One response I generated was, “maybe it’s just not so much about practicing what we eventually want to do, and that part is an illusion of the poor framing. We should figure out the right framing first and then ask whether it seems like practice, not optimize the framing to make it sound like practice.”

But I think my real response is: why is the superhuman part important, here? Maybe what’s really important is being able to get answers (eg medical advice) without putting them in (eg without fine-tuning on medical advice filtered for high quality), and asking for superhuman ability is just a way of helping ensure that? Or perhaps more generally, there are other things like this which you expect people to do wrong if they’re not dealing with a superhuman case, because you want the technology to eventually work for superhuman cases.
- Ajeya Cotra 10 Mar 2021 20:00 UTC
  LW: 10 AF: 7
  AF Parent
  In my head the point of this proposal is very much about practicing what we eventually want to do, and seeing what comes out of that; I wasn’t trying here to make something different sound like it’s about practice. I don’t think that a framing which moved away from that would better get at the point I was making, though I totally think there could be other lines of empirical research under other framings that I’d be similarly excited about or maybe more excited about.
  
  In my mind, the “better than evaluators” part is kind of self-evidently intriguing for the basic reason I said in the post (it’s not obvious how to do it, and it’s analogous to the broad, outside view conception of the long-run challenge which can be described in one sentence/phrase and isn’t strongly tied to a particular theoretical framing):
  
  I’m excited about tackling this particular type of near-term challenge because it feels like a microcosm of the long-term AI alignment problem in a real, non-superficial sense. In the end, we probably want to find ways to meaningfully supervise (or justifiably trust) models that are more capable than ~all humans in ~all domains.[4] So it seems like a promising form of practice to figure out how to get particular humans to oversee models that are more capable than them in specific ways, if this is done with an eye to developing scalable and domain-general techniques.
  
  A lot of people in response to the draft were pushing in the direction that I think you were maybe gesturing at (?) -- to make this more specific to “knowing everything the model knows” or “ascription universality”; the section “Why not focus on testing a long-term solution?” was written in response to Evan Hubinger and others. I think I’m still not convinced that’s the right way to go.