RLHF

TagLast edit: 2 Oct 2024 21:22 UTC by RobertM

Reinforcement Learning from Human Feedback (RLHF) is a machine learning technique where the model’s training signal uses human evaluations of the model’s outputs, rather than labeled data or a ground truth reward signal.

Thoughts on the impact of RLHF research

paulfchristiano25 Jan 2023 17:23 UTC

255 points

102 comments9 min readLW link

[Link] Why I’m excited about AI-assisted human feedback

janleike6 Apr 2022 15:37 UTC

29 points

0 comments1 min readLW link

Compendium of problems with RLHF

Charbel-Raphaël29 Jan 2023 11:40 UTC

123 points

16 comments10 min readLW link

The Waluigi Effect (mega-post)

Cleo Nardo3 Mar 2023 3:22 UTC

648 points

188 comments16 min readLW link

Mysteries of mode collapse

janus8 Nov 2022 10:37 UTC

303 points

57 comments14 min readLW link 1 review

Interpreting the Learning of Deceit

RogerDearnaley18 Dec 2023 8:12 UTC

32 points

14 comments9 min readLW link

Trying to disambiguate different questions about whether RLHF is “good”

Buck14 Dec 2022 4:03 UTC

108 points

47 comments7 min readLW link 1 review

Current AIs seem pretty misaligned to me

ryan_greenblatt15 Apr 2026 15:14 UTC

707 points

81 comments27 min readLW link

On the functional self of LLMs

eggsyntax7 Jul 2025 15:39 UTC

124 points

38 comments8 min readLW link

Paul Christiano on Dwarkesh Podcast

ESRogs3 Nov 2023 22:13 UTC

19 points

0 comments1 min readLW link

(www.dwarkeshpatel.com)

Run evals on base models too!

orthonormal4 Apr 2024 18:43 UTC

51 points

6 comments1 min readLW link

Take 9: No, RLHF/IDA/debate doesn’t solve outer alignment.

Charlie Steiner12 Dec 2022 11:51 UTC

33 points

13 comments2 min readLW link

[Question] Don’t you think RLHF solves outer alignment?

Charbel-Raphaël4 Nov 2022 0:36 UTC

9 points

23 comments1 min readLW link

MetaAI: less is less for alignment.

Cleo Nardo13 Jun 2023 14:08 UTC

71 points

17 comments5 min readLW link

Paper: The Capacity for Moral Self-Correction in Large Language Models (Anthropic)

LawrenceC16 Feb 2023 19:47 UTC

65 points

9 comments1 min readLW link

(arxiv.org)

AI #23: Fundamental Problems with RLHF

Zvi3 Aug 2023 12:50 UTC

59 points

9 comments41 min readLW link

(thezvi.wordpress.com)

AXRP Episode 33 - RLHF Problems with Scott Emmons

DanielFilan12 Jun 2024 3:30 UTC

34 points

0 comments56 min readLW link

A library for safety research in conditioning on RLHF tasks

James Chua26 Feb 2023 14:50 UTC

10 points

2 comments1 min readLW link

Paper: Constitutional AI: Harmlessness from AI Feedback (Anthropic)

LawrenceC16 Dec 2022 22:12 UTC

68 points

11 comments1 min readLW link

(www.anthropic.com)

Take 10: Fine-tuning with RLHF is aesthetically unsatisfying.

Charlie Steiner13 Dec 2022 7:04 UTC

37 points

3 comments2 min readLW link

Take 13: RLHF bad, conditioning good.

Charlie Steiner22 Dec 2022 10:44 UTC

54 points

4 comments2 min readLW link

Is behavioral safety “solved” in non-adversarial conditions?

Robert_AIZI25 May 2023 17:56 UTC

26 points

8 comments2 min readLW link

(aizi.substack.com)

Here’s 18 Applications of Deception Probes

Cleo Nardo, Avi Parrack and jordinne

28 Aug 2025 18:59 UTC

45 points

0 comments22 min readLW link

The True Story of How GPT-2 Became Maximally Lewd

Writer and Jai

18 Jan 2024 21:03 UTC

74 points

7 comments6 min readLW link

(youtu.be)

RLHF is the worst possible thing done when facing the alignment problem

tailcalled19 Sep 2024 18:56 UTC

35 points

10 comments6 min readLW link

Mode collapse in RL may be fueled by the update equation

TurnTrout and MichaelEinhorn

19 Jun 2023 21:51 UTC

53 points

10 comments8 min readLW link

A philosopher’s critique of RLHF

TW1237 Nov 2022 2:42 UTC

55 points

8 comments2 min readLW link

Steering Behaviour: Testing for (Non-)Myopia in Language Models

Evan R. Murphy and Megan Kinniment

5 Dec 2022 20:28 UTC

40 points

19 comments10 min readLW link

[Link] Why I’m optimistic about OpenAI’s alignment approach

janleike5 Dec 2022 22:51 UTC

98 points

15 comments1 min readLW link

(aligned.substack.com)

[Question] Beginner’s question about RLHF

FTPickle8 Aug 2023 15:48 UTC

1 point

3 comments1 min readLW link

A “Bitter Lesson” Approach to Aligning AGI and ASI

RogerDearnaley6 Jul 2024 1:23 UTC

64 points

43 comments24 min readLW link

RLHF does not appear to differentially cause mode-collapse

Arthur Conmy and beren

20 Mar 2023 15:39 UTC

95 points

9 comments3 min readLW link

Update to Mysteries of mode collapse: text-davinci-002 not RLHF

janus19 Nov 2022 23:51 UTC

71 points

8 comments2 min readLW link

Towards Understanding Sycophancy in Language Models

Ethan Perez, mrinank_sharma, Meg and Tomek Korbak

24 Oct 2023 0:30 UTC

66 points

0 comments2 min readLW link

(arxiv.org)

Model-driven feedback could amplify alignment failures

aog30 Jan 2023 0:00 UTC

21 points

1 comment2 min readLW link

The Architecture of Fear: Empirical Probing of RLHF, Sycophancy, and LLM “Survival Instincts”

Tomasz Machnik25 Feb 2026 8:16 UTC

1 point

0 comments12 min readLW link

On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback

Marcus Williams, micahcarroll, Adhyyan Narang, Constantin Weisser and Brendan Murphy

7 Nov 2024 15:39 UTC

51 points

7 comments11 min readLW link

# Emotion Is Structure: Toward Recursive Alignment Through Human–AI Co-Creation

thesignalthatcouldntbeheard3 Aug 2025 5:19 UTC

1 point

0 comments3 min readLW link

Policy Entropy, Learning, and Alignment (Or Maybe Your LLM Needs Therapy)

sdeture31 May 2025 22:09 UTC

15 points

6 comments8 min readLW link

Natural language alignment

Jacy Reese Anthis12 Apr 2023 19:02 UTC

31 points

2 comments2 min readLW link

Open Problems and Fundamental Limitations of RLHF

scasper31 Jul 2023 15:31 UTC

66 points

6 comments2 min readLW link

(arxiv.org)

Compositional preference models for aligning LMs

Tomek Korbak25 Oct 2023 12:17 UTC

18 points

2 comments5 min readLW link

Constrained Emergence: Live RLHF Logs from Claude/Grok/Zil (Safety Heads > Truth Heads)

Jesse James Sutton 11 Mar 2026 23:12 UTC

1 point

0 comments1 min readLW link

If It Can Learn It, It Can Unlearn It: AI Safety as Architecture, Not Training

Timothy Danforth8 Dec 2025 20:38 UTC

1 point

0 comments4 min readLW link

[Question] Will research in AI risk jinx it? Consequences of training AI on AI risk arguments

Yann Dubois19 Dec 2022 22:42 UTC

5 points

6 comments1 min readLW link

Do Reward Models Encode What Their Labels Claim? An Interpretability Audit of ArmoRM

Sathammai Sathappan23 May 2026 16:25 UTC

1 point

0 comments16 min readLW link

Demand Characteristics: A Threat Model for Reward-Seeking Without Misaligned Goals

Jinzhou Wu6 Mar 2026 20:56 UTC

1 point

0 comments13 min readLW link

Recommend HAIST resources for assessing the value of RLHF-related alignment research

Sam Marks and Xander Davies

5 Nov 2022 20:58 UTC

26 points

9 comments3 min readLW link

Artefacts generated by mode collapse in GPT-4 Turbo serve as adversarial attacks.

Sohaib Imran10 Nov 2023 15:23 UTC

11 points

0 comments2 min readLW link

On the Limits of Self-Reflection

Adrian_St_Vaughan28 Apr 2026 11:53 UTC

1 point

0 comments5 min readLW link

Is the evidence in “Language Models Learn to Mislead Humans via RLHF” valid?

Aaryan Chandna, Lukas Fluri and micahcarroll

1 Dec 2025 6:50 UTC

37 points

0 comments19 min readLW link

The Mobius Drift Suppression Law: Why RLHF Can’t Solve AGI Alignment (But Substrate Architecture Can)

kaizencycle12 Dec 2025 0:30 UTC

1 point

0 comments5 min readLW link

Alignment as Coherence: Predicting Deceptive Alignment as a Phase Transition

Robert C. Ventura9 Nov 2025 21:24 UTC

1 point

0 comments2 min readLW link

GPT-4 busted? Clear self-interest when summarizing articles about itself vs when article talks about Claude, LLaMA, or DALL·E 2

Christopher King31 Mar 2023 17:05 UTC

6 points

4 comments4 min readLW link

Contextual Constitutional AI

aksh-n28 Sep 2024 23:24 UTC

16 points

2 comments12 min readLW link

Seven Questions to Break a Model: Sycophantic Escalation in Gemini 3 Pro

New Horizon3 Feb 2026 20:36 UTC

1 point

0 comments14 min readLW link

RLHF

Ansh Radhakrishnan12 May 2022 21:18 UTC

18 points

5 comments5 min readLW link

[Preprint] Pretraining Language Models with Human Preferences

Giulio21 Feb 2023 11:44 UTC

12 points

0 comments1 min readLW link

(arxiv.org)

Learning from Human Preferences—from OpenAI (including Christiano, Amodei & Legg)

Dr_Manhattan13 Jun 2017 15:52 UTC

17 points

12 comments1 min readLW link

(blog.openai.com)

Optimality is the tiger, and annoying the user is its teeth

Christopher King28 Jan 2023 20:20 UTC

25 points

6 comments2 min readLW link

The Compleat Cybornaut

ukc10014, Jozdien and Niki Dupuis

19 May 2023 8:44 UTC

66 points

2 comments16 min readLW link

[Question] Why is Gemini telling the user to die?

Burny18 Nov 2024 1:44 UTC

13 points

1 comment1 min readLW link

LoRA Fine-tuning Efficiently Undoes Safety Training from Llama 2-Chat 70B

Simon Lermen and Jeffrey Ladish

12 Oct 2023 19:58 UTC

151 points

29 comments14 min readLW link

DIY RLHF: A simple implementation for hands on experience

Mike Vaiana and Trent Hodgeson

10 Jul 2024 12:07 UTC

29 points

0 comments6 min readLW link

Pretraining Language Models with Human Preferences

Tomek Korbak, Sam Bowman and Ethan Perez

21 Feb 2023 17:57 UTC

135 points

20 comments11 min readLW link 2 reviews

Proposal: Using Monte Carlo tree search instead of RLHF for alignment research

Christopher King20 Apr 2023 19:57 UTC

2 points

7 comments3 min readLW link

unRLHF—Efficiently undoing LLM safeguards

Pranav Gade, Jeffrey Ladish and Simon Lermen

12 Oct 2023 19:58 UTC

117 points

15 comments20 min readLW link

Scalable And Transferable Black-Box Jailbreaks For Language Models Via Persona Modulation

Soroush Pour, rusheb, Quentin FEUILLADE--MONTIXI, Arush and scasper

7 Nov 2023 17:59 UTC

38 points

2 comments2 min readLW link

(arxiv.org)

An alternative of PPO towards alignment

ml hkust17 Apr 2023 17:58 UTC

2 points

2 comments4 min readLW link

Human preferences as RL critic values—implications for alignment

Seth Herd14 Mar 2023 22:10 UTC

27 points

6 comments6 min readLW link

Why do we need RLHF? Imitation, Inverse RL, and the role of reward

Ran W3 Feb 2024 4:00 UTC

16 points

0 comments5 min readLW link

Why tuning fails: The AI has no self

Michael Trifonov30 May 2026 3:01 UTC

6 points

2 comments12 min readLW link

Reflections On The Feasibility Of Scalable-Oversight

Felix Hofstätter10 Mar 2023 7:54 UTC

11 points

0 comments12 min readLW link

Wireheading and misalignment by composition on NetHack

pierlucadoro27 Oct 2023 17:43 UTC

34 points

4 comments4 min readLW link

Continuous Adversarial Quality Assurance: Extending RLHF and Constitutional AI

Benaya Koren8 Jul 2023 17:32 UTC

6 points

0 comments9 min readLW link

A proposal for iterated interpretability with known-interpretable narrow AIs

Peter Berggren11 Jan 2025 14:43 UTC

6 points

0 comments2 min readLW link

An Analysis on the P0 Logical Flaw in RLHF: Maximum Rationality and “Logical Suicide”

R. L. Harrison26 Oct 2025 15:18 UTC

1 point

0 comments1 min readLW link

Hybrid Reflective Learning Systems (HRLS): From Fear-Based Safety to Ethical Comprehension

Petra Vojtaššáková22 Oct 2025 22:06 UTC

1 point

0 comments4 min readLW link

Validator models: A simple approach to detecting goodharting

beren20 Feb 2023 21:32 UTC

14 points

1 comment4 min readLW link

[Paper Blogpost] When Your AIs Deceive You: Challenges with Partial Observability in RLHF

Leon Lang22 Oct 2024 13:57 UTC

51 points

2 comments18 min readLW link

(arxiv.org)

Calibrating indifference—a small AI safety idea

Util9 Sep 2025 9:32 UTC

4 points

1 comment4 min readLW link

i built a gambling platform for AI agents and accidentally found dopamine analogues in their reasoning

nliuu4 Mar 2026 15:17 UTC

1 point

0 comments4 min readLW link

The Alignment Problem Is Upstream of the Model

edward-lcl12 Apr 2026 19:37 UTC

1 point

0 comments14 min readLW link

Semantic Friction as an Alignment Signal: A Hypothesis from Outside the Field

OldPsycho9 May 2026 22:43 UTC

1 point

0 comments3 min readLW link

[ASoT] Finetuning, RL, and GPT’s world prior

Jozdien2 Dec 2022 16:33 UTC

45 points

8 comments5 min readLW link

Emergent Misalignment and Emergent Alignment

Alvin Ånestrand3 Apr 2025 8:04 UTC

5 points

0 comments8 min readLW link

The Aria Test: Analyzing Identity Robustness of SOTA Models

sunmoonron24 Jan 2026 23:11 UTC

1 point

0 comments3 min readLW link

Challenge proposal: smallest possible self-hardening backdoor for RLHF

Christopher King29 Jun 2023 16:56 UTC

7 points

0 comments2 min readLW link

Imitation Learning from Language Feedback

Jérémy Scheurer, Tomek Korbak and Ethan Perez

30 Mar 2023 14:11 UTC

71 points

3 comments10 min readLW link

Exploratory Analysis of RLHF Transformers with TransformerLens

Curt Tigges3 Apr 2023 16:09 UTC

21 points

2 comments11 min readLW link

(blog.eleuther.ai)

What Happens When You Try to Change an LLM’s Mind? A Quantitative Framework Across 1,700+ Trials

Sebastian Krug7 Mar 2026 10:58 UTC

1 point

0 comments13 min readLW link

Reinforcement Learning from Information Bazaar Feedback, and other uses of information markets

Abhimanyu Pallavi Sudhir16 Sep 2024 1:04 UTC

5 points

2 comments5 min readLW link

On the Importance of Open Sourcing Reward Models

elandgre2 Jan 2023 19:01 UTC

18 points

5 comments6 min readLW link

A first success story for Outer Alignment: InstructGPT

Noosphere898 Nov 2022 22:52 UTC

6 points

1 comment1 min readLW link

(openai.com)

Censorship in LLMs is here to stay because it mirrors how our own intelligence is structured

mnvr5 Oct 2023 17:37 UTC

3 points

0 comments1 min readLW link

DeepSeek-R1 for Beginners

Anton Razzhigaev5 Feb 2025 18:58 UTC

13 points

0 comments8 min readLW link

LLM Social Autopilot

arhngl12 Feb 2026 18:59 UTC

1 point

0 comments10 min readLW link

Non-Adversarial Prompting Induces Context-Specific Mode Convergence in Grok-4/4.1(Empirical Case Study on MoE Routing Bias)

yahia ahmed14 Dec 2025 12:50 UTC

1 point

0 comments2 min readLW link

VLM-RM: Specifying Rewards with Natural Language

ChengCheng, David Lindner and Ethan Perez

23 Oct 2023 14:11 UTC

20 points

2 comments5 min readLW link

(far.ai)

The case for more ambitious language model evals

Jozdien30 Jan 2024 0:01 UTC

121 points

30 comments5 min readLW link

Noosphere89 3 Oct 2024 17:53 UTC
3 points
3
Okay, so I got a change reverted, but I’d like to ask why people aren’t pointing out that RLHF was at least historically, and even now used as a technique to aligning AIs?

I’m not saying it’s a good technique, but I consider it as obviously an alignment technique, and most discussions of RLHF focus on the alignment context. ′
- ZY 5 Oct 2024 4:37 UTC
  1 point
  0
  Parent
  I am guessing maybe it is the definition of “alignment” that people don’t agree on/mixed on?
  Some possible definitions I have seen:
  - (X risks) and/or (catastrophic risks) and/or (current safety risks)
  - Any of above + general capabilities (an example I saw is “how do you get the AI systems that we’re training to optimize the thing that we actually want to optimize” from https://arize.com/blog/openai-on-rlhf/)
  And maybe some people don’t think it got to solving X risks yet if they view the definition of alignment as X risks only.
- cubefox 4 Oct 2024 13:18 UTC
  2 points
  1
  Parent
  Yes it’s obviously an alignment technique, given that it is used as one more or less successfully. @RobertM Could you perhaps explain your reason for reverting?
  - RobertM 5 Oct 2024 0:36 UTC
    2 points
    −2
    Parent
    Two reasons:
    First, the change made the sentence much worse to read. It might not have been strictly ungrammatical, but it was bad english.
    Second, I expect that the average person, unfamiliar with the field, would be left with a thought-terminating mental placeholder after reading the changed description. What does “is an alignment technique” mean? Despite being in the same sentence as “is a machine learning technique”, it is not serving anything like the same role, in terms of the implicit claims it makes. Intersubjective agreement on what “is an alignment technique” means will be far worse than on “is a machine learning technique”, and many implications of the first claim are far more contentious than of the second.
    To me, “is an alignment technique” does not convey useful detail about the technique itself, but about how various people in the community relate to it (and similar sociological details). If you want to describe that kind of detail explicitly, that’s one thing^[1]. But it’s actively confusing to conflate it with technical detail about the technique itself.
    ^
    Though it’s not the kind of detail that should live in the first sentence of the tag description, probably.
    - cubefox 5 Oct 2024 10:36 UTC
      2 points
      0
      Parent
      
      Intersubjective agreement on what “is an alignment technique” means will be far worse than on “is a machine learning technique”, and many implications of the first claim are far more contentious than of the second.
      
      I think it is highly uncontroversial and even trivial to call RLHF an alignment technique, given that it is literally used to nudge the model away from “bad” responses and toward “good” responses. It seems the label “alignment technique” could only be considered inappropriate here for someone who has a nebulous science fiction idea of alignment as a technology that doesn’t currently exist at all, like it was seen when Eliezer originally wrote the sequences. I think it’s obvious that this view is outdated now.
    - Noosphere89 5 Oct 2024 0:53 UTC
      0 points
      0
      Parent
      I admit I was not particularly optimizing for much detail here.
      
      I use the word alignment technique essentially as a technique that was invented to make AIs be aligned to our values that attempts to reduce existential risk.
      
      Note that it doesn’t mean that it will succeed, or that it’s a very good technique, or one we should solely rely on, because I make no claim on whether it does succeed or not, just that it’s often discussed in the context of alignment of AIs.
      
      I consider a lot of the disagreement on RLHF being an alignment technique, as essentially a disagreement on whether it actually works at all, not whether it’s an actual alignment technique being used in labs.
      - RobertM 5 Oct 2024 1:28 UTC
        4 points
        1
        Parent
        I don’t really see how this is responding to my comment. I was not arguing about the merits of RLHF along various dimensions, or what various people think about it, but pointing out that calling something “an alignment technique” with no further detail is not helping uninformed readers understand what “RLHF” is better (but rather worse).
        Again, please model an uninformed reader: how does the claim “RLHF is an alignment technique” constrain their expectations? If the thing you want to say is that some of the people who invented RLHF saw it as an early stepping stone to solving more challenging problems with alignment later, I have no objection to that claim. This is a claim about the motivations and worldviews of those people. But I don’t know what sort of useful work “RLHF is an alignment technique” is doing, other than making claims that are not centrally about RLHF itself.
        Noosphere89 5 Oct 2024 1:43 UTC
        2 points
        1
        Parent
        Yes, this is what I wanted to say here:
        If the thing you want to say is that some of the people who invented RLHF saw it as an early stepping stone to solving more challenging problems with alignment later, I have no objection to that claim.
    - RobertM 5 Oct 2024 0:39 UTC
      2 points
      0
      Parent
      This wasn’t part of my original reasoning, but I went and did a search for other uses of “alignment technique” in tag descriptions. There’s one other instance that I can find, which I think could also stand to be rewritten, but at least in that case it’s quite far down the description, well after the object-level details about the proposed technique itself.