Trying to disambiguate different questions about whether RLHF is “good”

(A few of the words in this post were written by Ryan Greenblatt and Ajeya Cotra. Thanks to Oliver Habryka and Max Nadeau for particularly helpful comments.)

Sometimes people want to talk about whether RLHF is “a promising alignment strategy”, or whether it “won’t work” or “is just capabilities research”. I think that conversations on these topics are pretty muddled and equivocate between a bunch of different questions. In this doc, I’ll attempt to distinguish some of these questions, and as a bonus, I’ll give my opinions on them.

I wrote this post kind of quickly, and I didn’t have time to justify all the claims I make; I hope that this post is net helpful anyway. I’m sympathetic to claims that alignment researchers should err more on the side of writing fewer but better posts; maybe I’ll regret making this one now instead of waiting.

  • Is “make a powerful AGI by using RLHF, where the feedback is coming from unaided humans”, a promising strategy for building aligned AGI?

    • IMO, this seems like the baseline, “we didn’t really try much at all” alignment scheme. Ajeya calls this a “naive safety effort” in her training game post, which lays out the basic case for pessimism about this strategy.

    • Here’s how I’d quickly summarize my problems with this scheme:

      • Oversight problems:

        • Overseer doesn’t know: In cases where your unaided humans don’t know whether the AI action is good or bad, they won’t be able to produce feedback which selects for AIs that do good things. This is unfortunate, because we wanted to be able to make AIs that do complicated things that have good outcomes.

        • Overseer is wrong: In cases where your unaided humans are actively wrong about whether the AI action is good or bad, their feedback will actively select for the AI to deceive the humans.

      • Catastrophe problems:

        • Even if the overseer’s feedback was perfect, a model whose strategy is to lie in wait until it has an opportunity to grab power will probably be able to successfully grab power.

    • I don’t think we’re 100% doomed if we follow this plan, but it does seem pretty likely to go badly.

    • RLHF with unaided humans is not literally the most doomed alignment scheme I’ve ever heard seriously proposed. For example, “train models with automated rewards (e.g. simulating evolution, or training models on a curriculum of math problems) and hope that the resulting models are aligned” might be a worse plan. (Though it’s pretty plausible to me that this kind of scheme would have such obvious alignment issues that people would quickly switch to the naive safety plan.)

    • I don’t think that many alignment researchers are seriously proposing this naive plan. Many researchers who work on RLHF are sympathetic to the concerns that I listed here. For example, OpenAI’s alignment plan emphasizes the importance of using models to assist human evaluation.

  • Is RLHF broadly construed (i.e. pretraining a model and then fine-tuning it based on some overseer’s evaluations of its actions) plausibly part of a not-completely-doomed alignment plan?

    • IMO, yes, we are very likely to want to make powerful models by fine-tuning models based on overseer feedback, because fine-tuning models using preferences over their outputs is a broadly applicable strategy that can be used in lots of ways.

    • I think that our best current alignment strategies (I basically agree with Holden here on the current best plan) involve fine-tuning a model based on feedback from human overseers who have AI and software assistance, on inputs including some that were chosen by a red team who is trying to find inputs where the model behaves very badly.

    • If prosaic alignment researchers are able to come up with alignment schemes which look good on paper (i.e., don’t admit obvious counterexamples), I think it’s pretty likely that these alignment schemes will use RL from preferences as a component.

    • So yes, I think that RLHF is very likely to be part of good alignment plans we construct. But it doesn’t seem obvious to me that you should call it an “alignment plan” itself, it’s just an extremely general technique you might use in your alignment plan. (In the same way, backprop will probably appear in our alignment plan, but I wouldn’t call it an alignment technique.)

    • I think that it’s pretty plausible that if your goal was to build AGI as fast as possible and didn’t care if it was misaligned, you’d want to use RLHF as part of the training procedure–RLHF just seems pretty generally useful for making AIs that interact with the world.

    • (Am I being unreasonable by using “RLHF” to refer to RL on any kind of overseer feedback? Idk. I don’t really care about the terminology and I’m happy for someone else to make a proposal here. In practice, people seem to use RLHF to refer to RL fine-tuning of models based on some overseer’s evaluation of trajectories, regardless of whether the overseer is an unaided human or not.)

  • Isn’t that broad class of plans, where you fine-tune a model based on some overseer’s evaluations of its actions, really scary, because they’re selecting for models that look aligned, and so if you fail to align your model, it will act aligned and so its misalignment will be hard to detect?

    • Yep, I agree with this concern. (Ajeya calls this “playing the training game” in her post.) But I’m not really aware of any compelling alternatives to this class of plan–”training a model based on a reward signal” is basically all of machine learning, and so if you wanted to have an alignment strategy that’s competitive, I don’t see what else you can do.

    • I’m optimistic about various strategies that we can use to make it hard for models to be subtly misaligned in dangerous ways (e.g. the proposals mentioned in Holden’s doc here).

  • If we research RLHF more and then find improvements to it (e.g. we find ways to make the fine-tuning more sample-efficient), will we then have a really solid alignment plan?

    • Surely not. RLHF research won’t solve the whole problem on its own, because it doesn’t help with the oversight problems and catastrophe problems listed earlier.

    • That said, there’s some possibility that we have proposals for the oversight problems and catastrophe problems, and the bottleneck on using them is that our RLHF isn’t sample-efficient enough. E.g. suppose that in order to align our model, we only needed thousands of labels and critiques from a crack team of our best labellers; we’d be in a much better situation than if we needed millions. And so in the year of AGI, I’d like it if RLHF was more sample-efficient.

    • But increasing the sample-efficiency of RLHF also might mean that AGI happens earlier, which might be bad.

  • Does it ever make sense to do empirical research on training schemes which you don’t think are a complete solution to the alignment problem?

    • Yes, I think it makes sense to do alignment research on training schemes that aren’t complete alignment solutions. I think most alignment research fits into one of the following two categories, both of which seem worth doing even though they aren’t complete solutions:

      • The topic of your research is not a complete alignment solution, but might fit into the alignment plan and make things marginally better.

      • The topic of your research is not a complete alignment solution, and won’t even fit helpfully into the current alignment plan, but it is a prerequisite for research into other alignment techniques that would contribute to the alignment plan.

  • Did it produce any value to write the first papers applying RLHF to deep models? (Eg Christiano et al 2017, Ziegler et al 2020.)

    • IMO it produced some value. I think the case for this work is basically “we are eventually going to need to learn how to do RLHF, and we’d rather develop the early versions of these techniques early, because making RLHF happen somewhat earlier will probably mean that these techniques are somewhat more mature by the time AGI is developed”. I buy this argument a fair amount. I’m not sure what better empirical research these researchers could have done at the time.

    • See Ajeya’s comment here on the motivation of early RLHF research.

  • Has RLHF work so far caused massive net harm, by making it easier to commercialize AI systems (because that increases AI investment, which hastens AGI, which reduces the amount of alignment research which has happened by AGI)?

    • I’m not sure. I think it’s quite plausible that RLHF work (most obviously the recent work making ChatGPT, but also earlier stuff) has caused a notable counterfactual speedup of AGI development, and this seems probably bad.

      That said, I’m less convinced than the median LessWrong commenter that speeding up AGI is bad right now, mostly because I partially agree with the argument in the first bullet point here.

  • Should we do more research on improving RLHF (e.g. increasing its sample efficiency, or understanding its empirical properties) now?

    • I think this research, though it’s not my favorite kind of alignment research, probably contributes positively to technical alignment. Also, it maybe boosts capabilities, so I think it’s better to do this research privately (or at least not promote your results extensively in the hope of raising more funding). I normally don’t recommend that people research this, and I normally don’t recommend that projects of this type be funded.

  • Should we do research on alignment schemes which use RLHF as a building block? E.g. work on recursive oversight schemes or RLHF with adversarial training?

    • IMO, this kind of research is promising and I expect a large fraction of the best alignment research to look like this.

  • Is RLHF more dangerous than using prompt engineering or filtering or other methods to get a model’s outputs to be more useful to us?

    • RLHF is basically a way to take a pretrained language model (which just tries to predict what word would come next if it encountered this text on the internet) and get it to use all its knowledge to do more useful tasks instead of pure prediction. There are other ways that you could try to get a language model to do more useful tasks—for example, you could carefully select its prompt so that its predictions are likely to be useful, or you could have it generate 100 outputs and use a different model to select the best one.

    • I don’t think RLHF seems worse than these other ways you could try to get your model to use its capabilities to be helpful. (Many LessWrong commenters disagree with me here but I haven’t heard them give any arguments that I find compelling.)

    • I think that RLHF is reasonably likely to be safer than prompt engineering: RLHF is probably a more powerful technique for eliciting your model’s capabilities than prompt engineering is. And so if you need to make a system which has some particular level of performance, you can probably achieve that level of performance with a less generally capable model if you use RLHF than if you use prompt engineering.

    • At some point I hope to get around to arguing why research on e.g. safely using generative models seems pretty unpromising to me.

  • I’m sad about RLHF because it makes it easier to train models which superficially look aligned, which will make it harder to build consensus that powerful models are dangerously misaligned.

    • Idk, I agree that labs will probably make superficially aligned models, and this will fool some researchers and policy makers. That seems bad. But I don’t buy that this consideration makes RLHF research look net negative–I think that you will be able to make superficially aligned models with just a little bit of RLHF, so I don’t think that AI safety researchers boycotting RLHF research would have made much of a difference here.