Rohin Shah comments on Research directions Open Phil wants to fund in technical AI safety

Rohin Shah 9 Feb 2025 18:23 UTC
LW: 2 AF: 2
0
AF
Got it, that makes more sense. (When you said “methods work on toy domains” I interpreted “work” as a verb rather than a noun.)
But maybe I am underestimating the amount of methods work that can be done on MONA for which it is reasonable to expect transfer to realistic settings
I think by far the biggest open question is “how do you provide the nonmyopic approval so that the model actually performs well”. I don’t think anyone has even attempted to tackle this so it’s hard to tell what you could learn about it, but I’d be surprised if there weren’t generalizable lessons to be learned.
I agree that there’s not much benefit in “methods work” if that is understood as “work on the algorithm / code that given data + rewards / approvals translates it into gradient updates”. I care a lot more about iterating on how to produce the data + rewards / approvals.
My guess is that if debate did “work” to improve average-case feedback quality, people working on capabilities (e.g. the big chunk of academia working on improvements to RLHF because they want to find techniques to make models more useful) would notice and use that to improve feedback quality.
I’d weakly bet against this, I think there will be lots of fiddly design decisions that you need to get right to actually see the benefits, plus iterating on this is expensive and hard because it involves multiagent RL. (Certainly this is true of our current efforts; the question is just whether this will remain true in the future.)
For example I am interested in [...] debate vs a default training process that incentivizes the sort of subtle reward hacking that doesn’t show up in “feedback quality benchmarks (e.g. rewardbench)” but which increases risk (e.g. by making models more evil). But this sort of debate work is in the RFP.
I’m confused. This seems like the central example of work I’m talking about. Where is it in the RFP? (Note I am imagining that debate is itself a training process, but that seems to be what you’re talking about as well.)
EDIT: And tbc this is the kind of thing I mean by “improving average-case feedback quality”. I now feel like I don’t know what you mean by “feedback quality”.
- Fabien Roger 10 Feb 2025 12:21 UTC
  LW: 2 AF: 2
  0
  AF Parent
  I’m confused. This seems like the central example of work I’m talking about. Where is it in the RFP? (Note I am imagining that debate is itself a training process, but that seems to be what you’re talking about as well.)
  My bad, I was a bit sloppy here. The debate-for-control stuff is in the RFP but not the debate vs subtle reward hacks that don’t show up in feedback quality evals.
  I think we agree that there are some flavors of debate work that are exciting and not present in the RFP.