Thoughts on the impact of RLHF research

In this post I’m going to describe my basic justification for working on RLHF in 2017-2020, which I still stand behind. I’ll discuss various arguments that RLHF research had an overall negative impact and explain why I don’t find them persuasive.

I’ll also clarify that I don’t think research on RLHF is automatically net positive; alignment research should address real alignment problems, and we should reject a vague association between “RLHF progress” and “alignment progress.”

Background on my involvement in RLHF work

Here are some background views about alignment I held in 2015 and still hold today. I expect disagreements about RLHF will come down to disagreements about this background:

  • The simplest plausible strategies for alignment involve humans (maybe with the assistance of AI systems) evaluating a model’s actions based on how much we expect to like their consequences, and then training the models to produce highly-evaluated actions. (This is in contrast with, for example, trying to formally specify the human utility function, or notions of corrigibility /​ low-impact /​ etc, in some way.)

  • Simple versions of this approach are expected to run into difficulties, and potentially to be totally unworkable, because:

    • Evaluating consequences is hard.

    • A treacherous turn can cause trouble too quickly to detect or correct even if you are able to do so, and it’s challenging to evaluate treacherous turn probability at training time.

  • It’s very unclear if those issues are fatal before or after AI systems are powerful enough to completely transform human society (and in particular the state of AI alignment). Even if they are fatal, many of the approaches to resolving them still have the same basic structure of learning from expensive evaluations of actions.

In order to overcome the fundamental difficulties with RLHF, I have long been interested in techniques like iterated amplification and adversarial training. However, prior to 2017 most researchers I talked to in ML (and many researchers in alignment) thought that the basic strategy of training AI with expensive human evaluations was impractical for more boring reasons and so weren’t interested in these difficulties. On top of that, we obviously weren’t able to actually implement anything more fancy than RLHF since all of these methods involve learning from expensive feedback. I worked on RLHF work to try to facilitate and motivate work on fixes.

The history of my involvement:

  • My first post on this topic was in 2015.

  • When I started full-time at OpenAI in 2017 it seemed to me like it would be an impactful project; I considered doing a version with synthetic human feedback (showing that we could learn from a practical amount of algorithmically-defined feedback) but my manager Dario Amodei convinced me it would be more compelling to immediately go for human feedback. The initial project was surprisingly successful and published here.

  • I then intended to implement a version with language models aiming to be complete in the first half of 2018 (aiming to build an initial amplification prototype with LMs around end of 2018; both of these timelines were about 2.5x too optimistic). This seemed like the most important domain to study RLHF and alignment more broadly. In mid-2017 Alec Radford helped me do a prototype with LSTM language models (prior to the release of transformers); the prototype didn’t look promising enough to scale up.

  • In mid-2017 Geoffrey Irving joined OpenAI and was excited about starting with RLHF and then going beyond it using debate; he also thought language models were the most important domain to study and had more conviction about that. In 2018 he started a larger team working on fine-tuning on language models, which completed its initial RLHF project in 2019. This required building significant infrastructure for scaling and working with language models, since this work was happening in parallel with GPT-2.

  • Geoffrey later left for DeepMind and I took over the team. We wrote a follow-up paper polishing the result to the point where it seemed to be production-ready. Some people on the team started working on applying these results in production; Ryan Lowe ultimately led this effort which spun out into a different team (see paper). We also began working on simple settings where humans needed to use AI systems to solve subtasks (see paper). I left OpenAI at the start of 2021 to return to focusing on theory and Jan Leike took over the team.

The case for a positive impact

Overall, I think that early work on RLHF had significant value:

  • I think it is hard to productively work on more challenging alignment problems without first implementing basic solutions.

    • “Solve real problems one at a time” seems like a good way to make progress and is how most fields work. Trying to justify research on problem X by saying “well we could do RLHF, but it wouldn’t fix speculative problem X” is uncompelling to most audiences if no one has implemented RLHF or observed problem X. it’s even worse if they have plenty of more mundane examples of unaligned behavior unrelated to X.

    • Without implementing basic solutions it’s much harder to empirically validate your hypotheses about risks. We can make reasonable arguments about what failures will eventually occur with RLHF, but you can learn more by building the system and studying it. I think there are real, huge uncertainties here, and the safety community is taking weak arguments too seriously.

    • A lot of historical work on alignment seems like it addresses subsets of the problems solved by RLHF, but doesn’t actually address the important ways in which RLHF fails. In particular, a lot of that work is only necessary if RLHF is prohibitively sample-inefficient. Determining whether RLHF has fundamental difficulties seems like a good way to improve research prioritization.

  • Many more complex alignment proposals involve the same technical ingredients as RLHF, especially learning a reward from an expensive overseer. I think that debate and recursive reward modeling in particular are plausible approaches to alignment for mildly superhuman systems, and they build directly on RLHF.

  • Taking ideas from theory to practice helps build expertise about how to do so, which both informs alignment research and facilitates future implementation.

    • For example, a major point of disagreement between me and Eliezer is that Eliezer often dismisses plans as “too complicated to work in practice,” but that dismissal seems divorced from experience with getting things to work in practice (e.g. some of the ideas that Eliezer dismisses are not much more complex than RLHF with AI assistants helping human raters). In fact I think that you can implement complex things by taking small steps—almost all of these implementation difficulties do improve with empirical feedback.

    • Moreover, this kind of expertise is directly relevant when implementing future alignment proposals even if they are very different from RLHF. The implicit alternative seems to be an alignment community that deliberately avoids any problems that would be helpful for making AI systems useful, and potentially avoids doing any engineering work at all, creating predictable and potentially huge problems with implementation.

The case for a negative impact

People in the safety community make some arguments that research on RLHF has costs larger than these benefits. I don’t currently find these arguments persuasive:

  • RLHF (and other forms of short-term “alignment” progress) make AI systems more useful and profitable, hastening progress towards dangerous capabilities.

    • RLHF is just not that important to the bottom line right now.[1] Imitation learning works nearly as well, other hacky techniques can do quite a lot to fix obvious problems, and the whole issue is mostly second order for the current bottom line. RLHF is increasingly important as time goes on, but it also becomes increasingly overdetermined that people would have done it. In general I think your expectation should be that incidental capabilities progress from safety research is a small part of total progress, given that it’s a small fraction of people, very much not focused on accelerating things effectively, in a domain with diminishing returns to simultaneous human effort. This can be overturned by looking at details in particular cases, but I think safety people making this argument mostly aren’t engaging with details in a realistic way.

    • Trying to delay AI progress by avoiding making AI systems better at doing what people want feels holistically unwise. RLHF does not appear to increase the kind of capabilities that are directly relevant to risk, but instead has an indirect effect via making AI systems more useful. My intuitive reaction is similar to a proposal to lobby against improvements to the tax code so that taxes will be more painful and the public will be more opposed to new taxes. It might be OK if your goal is to reduce tax burden, but probably counterproductive for reducing the social cost of taxes.

    • Avoiding RLHF at best introduces an important overhang: people will implicitly underestimate the capabilities of AI systems for longer, slowing progress now but leading to faster and more abrupt change later as people realize they’ve been wrong. Similarly, to the extent you successfully slow scaling, you are then in for faster scaling later from a lower initial amount of spending—I think it’s significantly better to have a world where TAI training runs cost $10 billion than a world where they cost $1 billion. A key background view is that the great majority of effective safety work will come when people are working with systems that are much closer to posing a risk, e.g. so they can actually exhibit and study interesting forms of reward hacking and deceptive alignment. Overall in expectation I think these effects claw back most of the benefits of slowing down progress by avoiding RLHF.

  • RLHF “covers up problems” so that you can’t or won’t fix them in other ways.

    • RLHF lets you produce models that don’t do bad-looking things, but there are some things which look fine but are actually bad. So you might worry that RLHF makes problems harder to study by covering up their symptoms. But we can (and do) still train models without RLHF, or using a weak overseer where outputs can be validated by stronger overseers. It seems that RLHF makes it much easier to produce realistic examples of problems—both because it facilitates settings with the kind of realistic failure modes you actually want to study (namely overpowering or misleading overseers) and because without RLHF there are going to be a thousand other hacks to try first to fix the problems.

    • You might argue that RLHF gives people a way to cover up problems and so lets them avoid fixing them in deeper ways, or gives them a “false sense of security.” But in practice if people run into problems that can be fixed with RLHF, it looks like they will just do RLHF later (which is getting easier and easier over time). And in practice most of the problems that can be addressed with RLHF can be addressed in other hackier ways as well. This potential objection seems to rest on an unreasonably optimistic model about how superficial problems force people into pursuing deep fixes.

  • RLHF is less safe than imitation or conditioning generative models.

    • If we’re considering the danger posed by a model of a fixed level of usefulness, I think this is probably false though it’s a complicated question and I’m uncertain. The AI safety community makes various informal arguments about this which I find unpersuasive (though I mostly haven’t seen them laid out carefully). I suspect the differences are small and require empirical investigation. (While I appreciate many of the investigations in this paper and think it is good to improve our understanding, I don’t think they let us tell what’s up with risk.) This could be the subject of a much longer post and maybe will be discussed in the comments.

    • If RLHF poses distinctive risks, we are overwhelmingly more likely to avoid those risks by understanding them rather than by hoping no one ever implements RLHF. It’s unrealistic and deeply unstable to hope that no one uses RLHF because they didn’t think of it.

  • This entire alignment approach is impractical, and therefore all the arguments about “taking the first step in the right direction” are wrong. On top of that working on RLHF obfuscates that fact and dilutes what should be a robust community consensus.

    • To the extent this is true, I think it would be a pretty powerful argument against RLHF (largely because it implies that most of the benefits aren’t real). But I don’t agree that the approach can’t work. I’ve talked about this a lot with people, but feel like the arguments just aren’t holding together. The two weak links are on (i) arguments about the timing of difficulties relative to e.g. radically superhuman models—almost all of the arguments kick in after human level and it’s just not clear how far after, (ii) the probability of deceptive alignment emerging despite simple countermeasures, which I think of as a completely open empirical question—existing arguments are fine for arguing plausibility, but definitely can’t get you to 90% rather than 50%, (iii) the feasibility of fundamental improvements to RLHF.

Overall, I think it was valuable to use RLHF to fix the kind of basic alignment problems that are ubiquitous with pre-trained models. I think it has had a real impact facilitating work on more fundamental challenges, and helped move the community one step closer towards the kind of alignment solutions I expect to ultimately be successful.

Future work

I remain excited about “straightforward” approaches to improving RLHF, like devising better feedback (using combinations of human and AI work) and improving robustness by adversarial training. I think this work will continue to make ML systems more useful in practice, and so will be subject to the same kinds of objections as above. I still tentatively think this work is net positive and don’t find arguments against persuasive.

I think this follow-up research will also not need to solve the “fundamentally confusing” problems for a long time, but that solving tractable problems gives you a good chance of aligning modestly superhuman AI and facilitates future work on the remaining more challenging problems.

That said, I don’t think that improving or studying RLHF is automatically “alignment” or necessarily net positive. Research should be justified by an argument that it actually helps address important failures. Here are some types of work in this space that I’m particularly excited about:

  • Work that addresses robustness in cases where we cannot train on deployment examples, or where we care about failure rates that are small relative to fine-tuning dataset size. In practice this would happen if failures are very high-stakes, but we can also study synthetic domains where we artificially aim at very low datasets.

  • Training AI systems to give more correct answers in domains where human overseers can’t easily judge results and there is no other source of end-to-end feedback during training. That may involve giving humans better tools, studying and improving generalization from domains that do have feedback, or other methods.

  • Anything that addresses clear examples of alignment failures, for which we have good reasons to believe that models “know” things they aren’t telling us, or “know” what we want them to do but nevertheless do something else. Many of these will fall into the first two categories, but it’s also interesting to fix more mundane failures (e.g. obvious untruths) if they can be clearly identified as alignment problems.

  • Creating in vitro examples of problems analogous to the ones that will ultimately kill us, e.g. by showing agents engaging in treacherous turns due to reward hacking or exhibiting more and more of the core features of deceptive alignment.

  1. ^

    I would wildly guess that my involvement in RLHF and early language model training at OpenAI from 2017-2020 put me in the top 100 people accelerating AI progress but not in the top 10; I’d wildly guess that I accelerated progress by a few tenths of a percent during this period, and perhaps cut down timelines to powerful AI by a few days. I think there’s room for debate one way or the other on that.

    In some sense this is a big acceleration and it’s wrong to write it off as “not that important.” But I think accelerating a ChatGPT-style wakeup by a week is not a major cost (in addition to being plausibly positive, there just wasn’t that much AI-reducing-activity happening per week in the world of 2018).

    I also continue to think that RLHF is great, but that people overestimate (and misunderstand in all kinds of wild directions) the practical impact that it actually has on system behavior relative to the counterfactual training techniques.

    (I added this footnote long after the post was written, reacting to different people interpreting the post in very different ways, e.g. Oliver’s comments below and Michael Nielsen’s here.)