Institutionally, we are very uncertain whether to prioritize this (and if we do where it should be housed and how our giving should be structured).
It seems to me that the type of research you’re discussing here is already seen as a standard way to make progress on the full alignment problem—e.g. the Stiennon et al. paper you cited, plus earlier work on reward modeling by Christiano, Leike, and others. Can you explain why you’re institutionally uncertain whether to prioritise it—is it because of the objections you outlined? But your responses to them seem persuasive to me—and more generally, the objections don’t seem to address the fact that a bunch of people who are trying to solve long-term alignment problems actually ended up doing this research. So I’d be interested to hear elaborations and defences of those objections from people who find them compelling.
We’re simply not sure where “proactively pushing to make more of this type of research happen” should rank relative to other ways we could spend our time and money right now, and determining that will involve thinking about a lot of things that are not covered in this post (most importantly what the other opportunities are for our time and money).
already seen as a standard way to make progress on the full alignment problem
It might be a standard way to make progress, but I don’t feel that this work has been the default so far — the other three types of research I laid out seem to have absorbed significantly more researcher-hours and dollars among people concerned with long-term AI risk reduction. (It’s possible that human feedback is more common among people motivated by profit, but I doubt that because it doesn’t seem that profitable yet.)
Also, if we use a stricter definition of “narrowly superhuman” (i.e. the model should be capable of outperforming the evaluations — not just the demonstrations — of the humans training it), I’d argue that there hasn’t been any work published on that so far.
It seems to me that the type of research you’re discussing here is already seen as a standard way to make progress on the full alignment problem—e.g. the Stiennon et al. paper you cited, plus earlier work on reward modeling by Christiano, Leike, and others. Can you explain why you’re institutionally uncertain whether to prioritise it—is it because of the objections you outlined?
It’s important to distinguish between:
“We (Open Phil) are not sure whether we want to actively push this in the world at large, e.g. by running a grant round and publicizing it to a bunch of ML people who may or may not be aligned with us”
“We (Open Phil) are not sure whether we would fund a person who seems smart, is generally aligned with us, and thinks that the best thing to do is reward modeling work”
My guess is that Ajeya means the former but you’re interpreting it as the latter, though I could easily be wrong about either of those claims.
Nice post. The one thing I’m confused about is:
It seems to me that the type of research you’re discussing here is already seen as a standard way to make progress on the full alignment problem—e.g. the Stiennon et al. paper you cited, plus earlier work on reward modeling by Christiano, Leike, and others. Can you explain why you’re institutionally uncertain whether to prioritise it—is it because of the objections you outlined? But your responses to them seem persuasive to me—and more generally, the objections don’t seem to address the fact that a bunch of people who are trying to solve long-term alignment problems actually ended up doing this research. So I’d be interested to hear elaborations and defences of those objections from people who find them compelling.
We’re simply not sure where “proactively pushing to make more of this type of research happen” should rank relative to other ways we could spend our time and money right now, and determining that will involve thinking about a lot of things that are not covered in this post (most importantly what the other opportunities are for our time and money).
It might be a standard way to make progress, but I don’t feel that this work has been the default so far — the other three types of research I laid out seem to have absorbed significantly more researcher-hours and dollars among people concerned with long-term AI risk reduction. (It’s possible that human feedback is more common among people motivated by profit, but I doubt that because it doesn’t seem that profitable yet.)
Also, if we use a stricter definition of “narrowly superhuman” (i.e. the model should be capable of outperforming the evaluations — not just the demonstrations — of the humans training it), I’d argue that there hasn’t been any work published on that so far.
It’s important to distinguish between:
“We (Open Phil) are not sure whether we want to actively push this in the world at large, e.g. by running a grant round and publicizing it to a bunch of ML people who may or may not be aligned with us”
“We (Open Phil) are not sure whether we would fund a person who seems smart, is generally aligned with us, and thinks that the best thing to do is reward modeling work”
My guess is that Ajeya means the former but you’re interpreting it as the latter, though I could easily be wrong about either of those claims.