I’ve been thinking about Reinforcement Learning from Human Feedback (RLHF) a lot lately, mostly as a result of my AGISF capstone project attempting to use it to teach a language model to write better responses to Reddit writing prompts, a la Learning to summarize from human feedback.

RLHF has generated some impressive outputs lately, but there seems to be a significant amount of disagreement regarding its potential as a partial or complete solution to alignment: some are excited to extend the promising results we have so far, while others are more pessimistic and perhaps even opposed to further work along these lines. I find myself optimistic about the usefulness of RLHF work, but far from confident that all of the method’s shortcomings can be overcome.

How it Works

At a high level, RLHF learns a reward model for a certain task based on human feedback and then trains a policy to optimize the reward received from the reward model. In practice, the reward model learned is likely overfit—the policy can thus benefit from interpolating between a policy that optimizes the reward model’s reward and a policy trained through pure imitation learning.

A key advantage of RLHF is the ease of gathering feedback and the sample efficiency required to train the reward model. For many tasks, it’s significantly easier to provide feedback on a model’s performance rather than attempting to teach the model through imitation. We can also conceive of tasks where humans remain incapable of completing the tasks themselves, but can evaluate various completions and provide feedback on them. This feedback can be as simple as picking the better of two sample completions, but it’s plausible that other forms of feedback might be more appropriate and/​or more effective than this. The ultimate goal is to get a reward model that represents human preferences for how a task should be done: this is also known as Inverse Reinforcement Learning. The creators of the method, Andrew Ng and Stuart Russell, believe that “the reward function, rather than the policy, is the most succinct, robust, and transferable definition of the task,”. Think about training an AI to drive a car: we might not want it to learn to imitate human drivers, but rather learn what humans value in driving behavior in the abstract and then optimize against those preferences.

Outer Alignment Concerns

If a reward model trained through human feedback properly encoded human preferences, we might expect RLHF to be a plausible path to Outer Alignment. But this seems like a tall order, considering that humans can be assigned any values whatsoever, the easy goal inference problem is still hard, and that it’s easy to misspecify any model that attempts to correct for human biases or irrationality. Ambitious value learning is hard, and I’m not particularly confident that RLHF makes it significantly more tractable.

It’s also plausible that this approach of inferring a reward function for a task is just fundamentally misguided and that the way to get an outer aligned system is through the assistance-game or CIRL framework instead. There are definite advantages of this paradigm over the more standard reward learning setup that RLHF leverages. By treating humans as pieces of the environment and the reward function as a latent variable in the environment, an AI system can merge the reward learning and policy training functions that RLHF separates and thereby “take into account the reward learning process when selecting actions,”. This makes it easier to make plans conditional on future feedback, only gather feedback as and when it becomes necessary, and more fluidly learn from different forms of feedback.

Scalable oversight is hard

RLHF also relies upon humans being able to evaluate the outputs of models. This will likely be impossible for the kinds of tasks we want to scale AI to perform—it’s just going to be too hard for a human to understand why one output should be preferred over another. We’d simply have to hope that reward model generalization we’d seen previously, when oversight was still possible, continued to hold. Even if we thought we’d figured out how to evaluate our models’ outputs, there’s always the chance of an inner alignment failure or other deceptive behavior evading our oversight—we’d want to be absolutely certain that our reward and policy models were actually doing what we wanted them to do.

The solutions to the scalable oversight problem seem to primarily rely on AI-assistance and/​or breakthroughs in interpretability techniques. I think it’s clear how the latter might be useful: if we could just look at any model and be certain of its optimization objective, we’d probably feel pretty comfortable understanding the reward models and policy models we trained. AI-assistance might look something like recursive reward modeling: break the task that’s too hard to oversee into more manageable chunks that a human can oversee and train a model to optimize those tasks. Using the models trained on the narrower subtasks might make the original task possible to oversee: this is an idea that has been used for the task of summarizing books. It’s plausible that there are many tasks that resist this kind of decomposition, but the factored cognition approach might get us very far indeed.

Why I think RLHF is valuable

I’ll quote Paul Christiano here:

We are moving rapidly from a world where people deploy manifestly unaligned models (where even talking about alignment barely makes sense) to people deploying models which are misaligned because (i) humans make mistakes in evaluation, (ii) there are high-stakes decisions so we can’t rely on average-case performance.

This seems like a good thing to do if you want to move on to research addressing the problems in RLHF: (i) improving the quality of the evaluations (e.g. by using AI assistance), and (ii) handling high-stakes objective misgeneralization (e.g. by adversarial training).

In addition to “doing the basic thing before the more complicated thing intended to address its failures,” it’s also the case that RLHF is a building block in the more complicated things.

I think that (a) there is a good chance that these boring approaches will work well enough to buy (a significant amount) time for humans or superhuman AIs to make progress on alignment research or coordination, (b) when they fail, there is a good chance that their failures can be productively studied and addressed.

I generally agree with this. Solving problems that crop up in RLHF seems likely to transfer to other alignment methods, or at least be productive mistakes. The interpretability techniques we develop, outer or inner alignment failures we find, and latent knowledge we elicit from our reward and policy models all seem broadly applicable to future AI paradigms. In other words, I think the textbook from the future on AI Alignment is likely to speak positively of RLHF, at the very least as an early alignment approach.

Promising RLHF Research Directions (according to me)

I’d like to see different kinds of feedback be used in addition to preference orderings over model outputs. This paper specifies a formalism for the reward learning in general and considers several different kinds of feedback that might be appropriate for different tasks, e.g. demonstration, correction, natural language feedback, etc. A reward model that can gracefully learn from a wide array of feedback types seems like a desirable goal. This kind of exploration might also help us figure out better and worse forms of feedback and what kinds of generalization arise from each type.

Relatedly, I think it might be interesting to see how the assistance game paradigm performs in settings where the RLHF paradigm has been applied, like text summarization. On a theoretical level it seems clear that the assistance game setup offers some unique benefits and it would be cool to see those realized.

As we continue to scale RLHF work up, I want to see how we begin to decompose tasks so that we can apply methods like Recursive Reward Modeling. For book summarization, OpenAI used a fixed chunking algorithm to break the text down into manageable pieces, but it seems likely that other kinds of decomposition won’t be as trivial. We might need AI assistance to decompose tasks that we can’t oversee into tasks that we can. Training decomposition models that can look at a task and identify overseeable subtasks seems like a shovel-ready problem, perhaps one that we might even apply RLHF to.