Rohin Shah comments on The case for aligning narrowly superhuman models

Rohin Shah 6 Mar 2021 22:20 UTC
LW: 7 AF: 6
0
AF
It seems to me that the type of research you’re discussing here is already seen as a standard way to make progress on the full alignment problem—e.g. the Stiennon et al. paper you cited, plus earlier work on reward modeling by Christiano, Leike, and others. Can you explain why you’re institutionally uncertain whether to prioritise it—is it because of the objections you outlined?
It’s important to distinguish between:
- “We (Open Phil) are not sure whether we want to actively push this in the world at large, e.g. by running a grant round and publicizing it to a bunch of ML people who may or may not be aligned with us”
- “We (Open Phil) are not sure whether we would fund a person who seems smart, is generally aligned with us, and thinks that the best thing to do is reward modeling work”
My guess is that Ajeya means the former but you’re interpreting it as the latter, though I could easily be wrong about either of those claims.