Ajeya Cotra comments on The case for aligning narrowly superhuman models

Ajeya Cotra 6 Mar 2021 16:51 UTC
LW: 12 AF: 6
AF
We’re simply not sure where “proactively pushing to make more of this type of research happen” should rank relative to other ways we could spend our time and money right now, and determining that will involve thinking about a lot of things that are not covered in this post (most importantly what the other opportunities are for our time and money).

already seen as a standard way to make progress on the full alignment problem

It might be a standard way to make progress, but I don’t feel that this work has been the default so far — the other three types of research I laid out seem to have absorbed significantly more researcher-hours and dollars among people concerned with long-term AI risk reduction. (It’s possible that human feedback is more common among people motivated by profit, but I doubt that because it doesn’t seem that profitable yet.)

Also, if we use a stricter definition of “narrowly superhuman” (i.e. the model should be capable of outperforming the evaluations — not just the demonstrations — of the humans training it), I’d argue that there hasn’t been any work published on that so far.