# Compendium of problems with RLHF

Epistemic status: This post is a distillation of many comments/​posts. I believe that my list of problems is not the best organization of sub-problems. I would like to make it shorter, and simpler, because cool theories are generally simple unified theories, by identifying only 2 or 3 main problems without aggregating problems with different types of gear level mechanisms, but currently I am too confused to be able to do so. Note that this post is not intended to address the potential negative impact of RLHF research on the world, but rather to identify the key technical gaps that need to be addressed for an effective alignment solution. Many thanks to Walter Laurito, Fabien Roger, Ben Hayum, Justis Mills for useful feedbacks.

RLHF tldr: We need a reward function, we cannot hand-write it, let’s make the AI learn it!

Problem 0: RLHF is confusing. Human judgment and feedback is so brittle, that even junior alignment researchers like me thought that RLHF is a not-too-bad-solution to the outer alignment problem. I think RLHF confuses a lot of people and distracts people from the core issues. Here is my attempt to become less confused.

## Why RLHF counts as progress?

• Without RLHF, approximating the reward function for a “good” backflip was tedious and almost impossible. With RLHF it is possible to obtain a reward function, that when optimized by an RL policy leads to beautiful backflips.

• Value is complex? Value is fragile? But being polite is complex, being polite is fragile, and we can implement this roughly in ChatGPT[1].

• “RLHF withstands more optimization pressure than supervised fine-tuning: By using RLHF you can obtain reward functions, that withstand more optimization pressure, than traditionally hand-crafted reward functions. Insofar as that’s a key metric we care about, it counts as progress.” [Richard’s comment]

## Why RLHF is insufficient?

Buck highlights two main types problems with using RLHF to create an AGI: oversight issues and the potential for catastrophic outcomes. This partition of the problem space comes from the post Low-Stakes alignment. Although it may be feasible to categorize all problems under this partition, I believe that this categorization is not granular enough and lumps different types of problems together.

Davidad suggests dividing the use of RLHF into two categories: “Vanilla-RLHF-Plan” and “Broadly-Construed-RLHF” as part of the alignment plan. The “Vanilla-RLHF-plan” refers to the narrow use of RLHF (say as used in ChatGPT) and is not sufficient for alignment, while “Broadly-Construed-RLHF” refers to the more general use of RLHF as a building block and is potentially useful for alignment. The list below outlines the problems with “Vanilla-RLHF” that “Broadly-Construed-RLHF” should aim to address.

### Existing problems with RLHF because of (currently) non-robust ML systems

Those issues seem to be mostly a result of poor capabilities, and I think those problems may disappear as models grow larger.

1. Benign Failures: ChatGPT can fail. And it is not clear if this problem will disappear as capabilities scale.

2. Mode Collapse, i.e. a drastic bias toward particular completions and patterns. Mode collapse is expected when doing RL.

3. You need regularization. Your model can severely diverge from the model you would have gotten if had gotten feedback in real-time from real humans. You need to use a KL divergence and choose the constant of regularization. Choosing this constant feels arbitrary. Without this KL divergence, you get mode collapse.(see the annex here).

My opinion: The benign failures seem to be much more problematic than mode collapse and the need for regularization. Currently, prompt injections seem to be able to bypass most security measures. And we can’t even count on respectability incentives to push OpenAI and other companies to deploy more robust solutions. For example,the public was not happy with the fact that the AI kept repeating “I am an AI developed by OpenAI”, which pushed OpenAI to release the January 9 version that is again much more hackable than the December 15 patch version (benchmark coming soon). So this is one of the problems that seems to me the most severe. But it may be possible to reduce a lot of this kind of failure with more focus on engineering andmore capable systems. So I rather predict that this kind of error will disappear in the future, but not with high confidence.

### Incentives issues of the RL part of RLHF

I expect those problems to be more salient in the future, as it seems to be suggested by recent work done at Anthropic.

4. RL makes the system more goal-directed, less symmetric than base model. For example, we can tell Paul Christiano’s story of the model which was over-optimized by OpenAI. This model was trained using a positive sentiment reward system and ended up determining that wedding parties were the most positive subject. Whenever the model was given a prompt, it would generate text that described a wedding party. This breaks the symmetry: Fine-tuning a large sequence model with RLHF shapes a model that steers the sequence in rewarding directions. The model has been shaped to maximize its reward by any means necessary[2], even if it means suddenly delivering an invitation to a wedding party. This is weak evidence towards the “playing the training game” scenario.

5. Instrumental convergence: Larger RLHF models seem to exhibit harmful self-preservation preferences, and *sycophancy*: insincere agreement with user’s sensibilities. [paper, short summary]

6. Incentivize deception:RLHF/​IDA/​debate all incentivize promoting claims based on what the human finds most convincing and palatable, rather than on what’s true. RLHF does whatever it has learned makes you hit the “approve” button, even if that means deceiving you.” [from Steiner]. See also the robotic hand in Deep Reinforcement Learning From Human Preferences (Christiano et al, 2017) and comments on how this would scale.

7. RL could make thoughts opaque. See By Default, GPTs Think In Plain Sight: “GPTs’ next-token-prediction process roughly matches System 1 (aka human intuition) and is not easily accessible, but GPTs can also exhibit more complicated behavior through chains of thought, which roughly matches System 2 (aka human conscious thinking process). Humans will be able to understand how a human-level GPTs (trained to do next-token-prediction) complete complicated tasks by reading the chains of thought. GPTs trained with RLHF will bypass this supervision”.

8. Capabilities externalities: RL already increases the capabilities of base models, and RL could be applied to further increase capability by a lot, and could go much further than imitation. This has already burnt timelines, and will continue to do so.

My opinion:

• I’m not able to tell if point 4 is a big deal or not. I will be more concerned if the attractors that are created as a result of RLHF are of a bad nature. But as it is, it seems that the attractors that are created as a result of the RLHF are just emanations of repetition or diversity problems. To change my mind, you would have to show me an attractor that is bad in nature, and was still strengthened during the RL. Also, the models that result from RL shouldn’t be more directed than rejection sampling.

• 5 and 6 seem to be potential issues.

• The “Opaque thoughts” problem is more speculative, and I think this will be a problem only on advanced systems using scratchpads and long-term memory.

• 8 feels important, even if it is not a central point.

9. RLHF requires a lot of human feedback, and still exhibits failures. To align ChatGPT, I estimated that creating the dataset costed 1M of dollars, which was roughly the same price as the training of GPT3[3]. [details] And 1M is still not sufficient to solve the problem of benign failures currently, and much more effort could be needed to solve those problems completely. As systems become more advanced, they may require more complex data and the cost of obtaining the necessary data may become prohibitive. Additionally, as we push the limits of compute scale beyond that of human capabilities, we may encounter limitations in terms of the number of qualified annotators available.

10. “Human operators are fallible, breakable, and manipulable. Human raters make systematic errors—regular, compactly describable, predictable errors.“ [LOL20]. And this remains true even if you recruit humans on Bountied Rationality and pay them a reasonable \$50 an hour, as OpenAI did… Several categories of people have been recruited by OpenAI, and Bountied Rationality seems to be the upper tier, according to the time: “To get those labels, OpenAI sent tens of thousands of snippets of text to an outsourcing firm in Kenya, beginning in November 2021. Much of that text appeared to have been pulled from the darkest recesses of the internet. Some of it described situations in graphic detail like child sexual abuse, bestiality, murder, suicide, torture, self harm, and incest.” Aligning an AI with RLHF requires going through unaligned territory, and I can’t imagine what the unaligned territory will look like in the future. This will probably lead annotators to psychological problems.

11. You are using a proxy, not human feedback directly: The model is a proxy trained on human feedback, that represents what humans want, you then use the model to give reward to a policy[4]. That’s less reliable as having an actual human give the feedback directly to the model.

12. How to scale human oversight? Most of the challenge of RLHF in practice is in getting humans to reward the right thing, and in doing so at a sufficient scale. [more here]

My opinion:

• A solution to point 12 would be Scalable Oversight, Model assistance. See the critiques paper as a proof of concept. Counter solution: Alignment tax because Agent AIs will be more intel⁣li⁣gent than Tool AIs, in ad⁣di⁣tion to their greater eco⁣nomic com⁣pet⁣i⁣tive⁣ness.

• Point 9 is not currently a significant problem because the cost of 1M for OpenAI is relatively low.

• Using RL(AI)F may offer a solution to all the points in this section: By starting with a set of established principles, AI can generate and revise a large number of prompts, selecting the best answers through a chain-of-thought process that adheres to these principles. Then, a reward model can be trained and the process can continue as in RLHF. This approach is potentially better than RLHF as it does not require human feedback.

### Superficial Outer Alignment

13. Superficially aligned agents: Bad agents are still there beneath. The capability is still here[5].

14. The system is not aligned at all at the beginning of the training and has to be pushed in dangerous directions to train it. For more powerful systems, the beginning of the training could be the most dangerous period.

15. RLHF is not a specification, only a process. Even if you don’t buy the outer/​inner decomposition, RLHF is a more limited way of specifying intended behavior+cognition than what we would want, ideally. (Your ability to specify what you want depends on the behavior you can coax out of the model.). RLHF is just a fancy word for preference learning, leaving almost the whole process of what reward the AI actually gets undefined. [source]. Comment: No one has been able to obtain exhaustive specification of human values, but maybe a constitution is sufficient? (and we must then make the assumption that the constitution represents sufficiently our values)

16. If you have the weights of the model, it is possible to misalign it by fine-tuning it again in another direction, i.e. fine-tuning it so that it mimics Hitler or something else. This seems very inexpensive to do in terms of computation. If you believe in the orthogonality thesis, you could fine-tune your model toward any goal.

My opinion:

• Points 13, 15, and 16 appear to be low-priority problems, but they may indicate new paradigmatic ways of solving the issues at hand.

• 16 seems a bit unrealistic and is not a central point.

The following points in this post are not specifically points against RLHF, today no technique applied to language models solves the strawberry problem or the generalization problem. But it is useful to recall these problems.

### The Strawberry problem

RLHF does not a priori solve the strawberry problem. Nates Soares sums up the problem as:

17. Pointer problem: Directing a capable AGI towards an objective of your choosing.

18. Corrigibility:Ensuring that the AGI is low-impact, conservative, shutdownable, and otherwise corrigible.”

“These two problems appear in the strawberry problem, which Eliezer’s been pointing at for quite some time: the problem of getting an AI to place two identical (down to the cellular but not molecular level) strawberries on a plate, and then do nothing else. The demand of cellular-level copying forces the AI to be capable; the fact that we can get it to duplicate a strawberry instead of doing some other thing demonstrates our ability to direct it; the fact that it does nothing else indicates that it’s corrigible (or really well aligned to a delicate human intuitive notion of inaction).” [from the Sharp left turn post]

My opinion: At the end of the day, I’m not that worried about corrigibility at the moment. So far, we’ve got no evidence of ChatGPT failing to be corrigible and little evidence of it having a wrong pointer. The issues I’m raising are not specific to RLHF, but to priors on techniques which don’t successfully prove they are safe against those (i.e. all techniques presented to this day). Even if Anthropic tries to demonstrate this in the model-written-evals (see fig. 1), it seems to me that that’s not really disproving corrigibility. I think it would be pretty easy to implement corrigibility because “To allow oneself to be turned off”, is not such a complicated task, and humans should be able to evaluate this and implement this with RLHF. If we had a robot with the same cognitive performance as ChatGPT, it would be easy to fine-tune it to be corrigible. So we still don’t have much evidence.

### Unknown properties under generalization

19. Distributional leap: RLHF requires some negative feedback in order to train it. For large-scale tasks, this could maybe require killing someone to continue the gradient descent? We could avoid this with other techniques like Training Language Models with Language Feedback, or maybe we could use a form of curriculum learning, to start the training with benign actions and then finish on actions with potentially large consequences, but it seems that ‘ killing someone’ is what would happen for large scale vanilla-RLHF trainings.

• Unknown properties under generalization: Even at the limit of the amount of data & variety you can provide via RLHF, when the learned policy generalizes perfectly to all new situations you can throw at it, the result will still almost certainly be misaligned because there are still near infinite such policies, and they each behave differently on the infinite remaining types of situation you didn’t manage to train it on yet. Because the particular policy is just one of many, it is unlikely to be correct. [from Maxwell Clarke]

• Large distributional shift to dangerous domains: RLHF does not a priori generalize optimization-for-alignment you did in safe conditions, across a big distributional shift to dangerous conditions [LOL10]

• Sim to real is hard: RLHF won’t enable you to perform a pivotal act [LOL11]. Under the current paradigm, you would need the model to execute multiple pivotal acts, then assess each one. You don’t go from “Do you prefer, the backflip on the right or on the left” to “Do you prefer the pivotal act on the right or on the left?”.

• High intelligence is a large shift. Once the model become very intelligent and agentic because of RLHF, this is akin to a large shift of distribution. [LOL12]

20. Sharp left turn: How to scale safely if model performances go up? Capability generalizes further than alignment. Some values collapse when self-reflecting. The alignment problem now requires we look into the details of generalization. This is where all the interesting stuff is. According to Nate Soares, this problem is upstream the problem of “Directing a capable AGI towards an objective of your choosing and ensuring that the AGI is low-impact, conservative, shutdownable, and otherwise corrigible.”. So, even if 17 and 18 seem to be solved, 20 can still arise.

My opinion:

• 19. Distribution shift seems to be a problem during testing and seems to be plausible.

• 20. The sharp left turn seems to be a problem during training. It’s uncertain whether it would occur during pre-training or fine-tuning. I have the intuition that the sharp left turn is also plausible, but this is much more sketchy.

## Final thoughts

To make a long story short, vanilla-RLHF does not solve any of the problems listed in the list of lethalities.

• What would happen if we used RLHF on an unaligned AGI and not ChatGPT?

• Which of the problems listed here would happen first?

• How can we reduce the list of problems to a few sub-problems?

1. ^

I tend to agree with Katja Grace that values are not fragile in the sense imagined in the sequences [link].

2. ^
3. ^

It’s actually not that expensive, I’m willing to buy an aligned AI for a lot more than that. But it gives a lower bound on the order of magnitude of the alignment fee for RLHF.

4. ^

This does not seem to be a problem per se if your model of the human giving feedbacks is robust. But your model has to be robust. Also keep in mind that even pure human feedback is also likely to lead to AI takeover.

5. ^

As a first approximation, I suspect we can consider that only the upper layers of the model have been refined, the lower intermediate layers having not been modified. In Sparrow, only the upper 16 layers have been fine-tuned.

Crossposted to EA Forum (18 points, 0 comments)
• I do AI policy, and it has been extremely rare for me to see something that sticks out to me as something as valuable as this. I don’t think you’re aware how incredibly valuable your gifs are for describing problems with RLHF. The human brain is structured to glaze over, ignore, or forget text, while considering visual evidence with the full force of reasoning faculties. Visual information is vastly more intuitive and digestible that written language, it is like “getting your foot in the door” to someone’s limited daily allowance of attention and higher reasoning faculties. Although it requires some buy in to figure out ways to correctly and honestly describe the problem with gifs, the communication value (to other ML researchers) is massive and the ratio of honest, successful communication to effort is extremely high. Using gifs to visually describe the problems with RLHF just scales really well with more time, effort, and cognition spent on really accurate gifs, after making the first gifs.

Most people don’t notice or seriously consider the problem with RLHF because they don’t feel like digesting text criticizing RLHF, and more gifs will give experts a fair chance to have accurate thoughts about the problems with RLHF. I’m not an expert and I don’t know how difficult it is to actually make gifs that can describe the complex problems, but I do know that if the bare minimum is managed at making such gifs, the paper has a much higher chance of causing a paradigm shift among ML researchers than the average ML researcher might think. Even if describing most of the RLHF problems with gifs is impossible or systematically fails, it will still have its effect amplified by allowing ML researchers to begin communicating problems to tech executives and policymakers who manage their projects and funding.

I hope that this isn’t your final compendium on RLHF (I’m bookmarking it either way) and that you spend at least a little bit of time evaluating whether it’s possible to describe problems with RLHF to ML researchers using gifs. This is a gold mine, and it never occurred to me that it could be done until I saw that gif. If you can’t figure out a way to do it yourself, I recommend asking around for information about funding such as the EA future fund and ask about funds and grantmaking at rationalist events and people will probably be able to connect you to large sources of funding, teams of artists and animators to handle the grunt work, or even other ML researchers who have a lot of knowledge of ways to figure out ways to accurately depict RLHF problems visually (think rob bensinger or eliezer yudkowsky). I hope that this isn’t your final compendium on RLHF (I’m bookmarking it either way) and that you spend at least a little bit of time evaluating whether it’s possible to describe problems with RLHF to ML researchers using gifs.

• Thank you! Yes, for most of these issues, it’s possible to create GIFs or at least pictograms. I can see the value this could bring to decision-makers.

However, even though I am quite honored, it’s not because I wrote this post that I am the best person to do this kind of work. So, if anyone is inspired to work on this, feel free to send me a private message.

• Disclaimer: This comment was written as part of my application process to become an intern supervised by the author of this post.

## Potential uses of the post:

This post is an excellent summary, and I think it has great potential for several purposes, in particular being used as part of a sequence on RLHF. It is a good introduction for many reasons:

1. It’s very useful to have lists like those, easily accessible to serve as reminders or pointers when you discuss with other people.

2. For aspiring RLHF understanders, it can provide minimum information to quickly prioritize what to learn about.

3. It can be used to generate ideas of research (“which of these problems could I solve?”) or superficially check that an idea is not promising (“it looks fancy, but actually it does not help against this problem”).

4. It can be used as a gateway to more in-depth articles. To that end, I would really appreciate it if you put links for each point, or mention that you are not aware of any specific article on the subject.

## Meta level critics:

If it is taken as an introduction to RLHF risks, you should make clear where this list is exhaustive (to the best of your knowledge). This will allow readers who are aware it isn’t to easily propose additions.
To facilitate its improvement, you should make explicit calls to the reader to point out where you suspect the post might fail; in particular, there could be a class of readers who are experts in a specific problem with RLHF not listed here, who come only to get a glimpse of related failure modes. They should be encouraged to participate.

As Daniel Kokotajlo and trevor have pointed out, the main value of this post is to provide an easy way to learn more about the problems with RLHF (as opposed to e.g. LOL which tries to be an insightful, comprehensive compilation on its own), thanks to the format and the organization.

The epistemic status of each point is unclear, which I think is a big issue. You give your thoughts after each section, but there is a big lack of systematic evaluation. You should separate for each point:

• its severity,

• its likelihood,

• whether we have empirical, theoretical evidence or abstract reasons it should happen.

This has not been done in a systematic fashion, and it could be organized more clearly.

## More specific criticism:

1. I am unsatisfied with how 7) is described. It is not a problem on the same level as others, more the destruction of a quality that fortunately seems to happen by default on GPTs. It could use a more in-depth explanation, especially since the linked article is mostly speculation.

2. I also think 11) belongs to this category of ‘not quite a problem’, because it is not obvious that direct human feedback would be better than learning a model of it.
Maybe an easy way to predict humans noticing misalignment is to have a fully general model of what it means to be misaligned? Unlikely, but it deserves a longer discussion.

3. 9) is another point that requires a longer discussion. Since it seems to be your own work, maybe you could write an article and link to it?
What are the costs of RLHF (money and manpower) and how do they compare to scaling laws? Maybe it’s an issue… but maybe not. Data is needed here.

4. Talking about the Strawberry Problem is a bit unfair, because RLHF was never meant to solve it, so not only is it not surprising RLHF provides little insight into the Strawberry Problem, I also don’t expect that a solution to the Strawberry Problem would relate at all with RLHF. It seems like a different paradigm altogether.

5. More generally, RLHF is exactly the kind of methods warned against by a security mindset. It is an ad hoc method that afaik provides no theoretical guarantee of working at all. The issues with superficial alignment and the inability to generalize alignment in case of a distributional shift are related to that.

6. Why would we have any reason a priori to expect good behavior from RLHF? In the first section, you give empirical reasons to count RLHF as progress but a discussion of the reasons RLHF was even considered in the first place is noticeably lacking.
To be honest, I am very surprised there is no mention of that. Did OpenAI not disclose how they invented RLHF? Did they randomly imagine the process and it happened to work?

In conclusion, I believe that there is a strong need for this kind of post, but that it could be polished more for the potential purposes proposed above.

• It’s gonna take me a while to digest this post, but in the meantime, thank you! This is the sort of content I love to see. (ETA: I strong-upvoted this post)

• I find it useful to work on a spreadsheet to think about the severity of these problems. Here is my template.

• If we had a robot with the same cognitive performance as ChatGPT, it would be easy to fine-tune it to be corrigible.

This is false, and the reason may be a bit subtle. Basically “agency” is not a bare property of programs, it’s a property of how programs interact with their environment. ChatGPT is corrigible relative to the environment of the real world, in which it just sits around outputting text. This is easy because it’s not really an agent relative to the real world! However, ChatGPT is an agent relative to the text environment—it’s trying to steer the text in a preferred direction [1].

A robot that literally had the same cognitive performance as ChatGPT would just move the robot body in a way that encoded text, not in a way that had any skill at navigating the real world. But a robot that had analogous cognitive capabilities as ChatGPT except suited for navigating the real world would be able to navigate the real world quite well, and would also have corrigibility problems that were never present in ChatGPT because ChatGPT was never trying to navigate the real world.

1. ^

AFAIK this is only precisely true for the KL-penalty regularized version of RLHF, where you can think of the finetuned model as trying to strategically spend its limited ability to update the base transition function, in order to steer the trajectory to higher reward. For early stopping regularized RLHF you probably get something mathematically messier.

• Thanks, I overlooked this and it makes sense to me. However, I’m not as certain about your last sentence:

and would also have corrigibility problems that were never present in ChatGPT because ChatGPT was never trying to navigate the real world.

I agree with the idea of “steering the trajectory,” and this is a possibility we must consider. However, I still expect that if we train the robot to use the “Shut Down” token when it hears “Hi RobotGPT, please shut down,” I don’t see why it wouldn’t work.

It seems to me that we’re comparing a second-order effect with a first-order effect.

• The model has been shaped to maximize its reward by any means necessary[2], even if it means suddenly delivering an invitation to a wedding party. This is weak evidence towards the “playing the training game” scenario.

This conclusion seems unwarranted. What we have observed is (Paul claiming the existence of) an optimized model which ~always brings up weddings. On what basis does one infer that “the model has been shaped to maximize its reward by any means necessary”? This is likewise not weak evidence for playing the training game.

• The “Davidad suggests” link is broken, and maybe should go to this comment?

• Fixed, thanks!

• This was really helpful, thanks for the post!

• Using RL(AI)F may offer a solution to all the points in this section: By starting with a set of established principles, AI can generate and revise a large number of prompts, selecting the best answers through a chain-of-thought process that adheres to these principles. Then, a reward model can be trained and the process can continue as in RLHF. This approach is potentially better than RLHF as it does not require human feedback.

I’d like to say that I fervently disagree with . Giving an unaligned AI the opportunity to modify its own weights (by categorizing its own responses to questions), then politely asking it to align itself, is quite possibly the worst alignment plan I’ve ever heard; it’s penny-wise, pound-foolish. (Assuming it even is penny-wise; I can think of several ways to generate a self-consistent AI that would cost less.)

• I found this quite helpful, even if some points could use a more thorough explanation.

• the public was not happy with the fact that the AI kept repeating “I am an AI developed by OpenAI”, which pushed OpenAI to release the January 9 version that is again much more hackable than the December 15 patch version (benchmark coming soon).

Wow, that sounds bad. Do you have any source for this?