Ah cool, I see—your concern is that maybe RLHF is perhaps better left to the capabilities people, freeing up AI safety researchers to work on more neglected approaches.
That seems right to me, and I agree with it as a general heuristic! Some caveats:
I’m random person who’s been learning a lot about this stuff lately, definitely not an active researcher. So my opinions about heuristics for what to work on probably aren’t worth much.
If you think RLHF research could be very impactful for alignment, that could make up for it being less neglected than other areas.
Distinctive approaches to RLHF (like Redwood’s attempts to get their reward model’s error extremely low) might be the sorts of things that capabilities people wouldn’t try.
Finally, as a historical note, it’s hard to believe that a decade ago the state of alignment was like “holy shit, how could we possibly hard-code human values into a reward function this is gonna be impossible.” The fact that now we’re like “obviously big AI will, by default, build their AGIs with something like RLHF” is progress! And Paul’s comment elsethread is heartwarming—it implies that AI safety researchers helped accelerate the adoption of this safer-looking paradigm. In other words, if you believe RLHF helps improve our odds, then contra some recent pessimistic takes, you believe that progress is being made :)
Ah cool, I see—your concern is that maybe RLHF is perhaps better left to the capabilities people, freeing up AI safety researchers to work on more neglected approaches.
That seems right to me, and I agree with it as a general heuristic! Some caveats:
I’m random person who’s been learning a lot about this stuff lately, definitely not an active researcher. So my opinions about heuristics for what to work on probably aren’t worth much.
If you think RLHF research could be very impactful for alignment, that could make up for it being less neglected than other areas.
Distinctive approaches to RLHF (like Redwood’s attempts to get their reward model’s error extremely low) might be the sorts of things that capabilities people wouldn’t try.
Finally, as a historical note, it’s hard to believe that a decade ago the state of alignment was like “holy shit, how could we possibly hard-code human values into a reward function this is gonna be impossible.” The fact that now we’re like “obviously big AI will, by default, build their AGIs with something like RLHF” is progress! And Paul’s comment elsethread is heartwarming—it implies that AI safety researchers helped accelerate the adoption of this safer-looking paradigm. In other words, if you believe RLHF helps improve our odds, then contra some recent pessimistic takes, you believe that progress is being made :)