RLHF

TagLast edit: 13 Nov 2022 2:18 UTC by Multicore

Reinforcement Learning from Human Feedback (RLHF) is a machine learning technique where the model’s training signal uses human evaluations of the model’s outputs, rather than labeled data or a ground truth reward signal.

Thoughts on the impact of RLHF research

paulfchristiano25 Jan 2023 17:23 UTC

242 points

102 comments9 min readLW link

[Link] Why I’m excited about AI-assisted human feedback

janleike6 Apr 2022 15:37 UTC

29 points

0 comments1 min readLW link

Compendium of problems with RLHF

Charbel-Raphaël29 Jan 2023 11:40 UTC

120 points

16 comments10 min readLW link

A crazy hypothesis: GPT-4 already is agentic and is trying to take over the world!

Christopher King24 Mar 2023 1:19 UTC

−2 points

11 comments9 min readLW link

Interpreting the Learning of Deceit

RogerDearnaley18 Dec 2023 8:12 UTC

30 points

10 comments9 min readLW link

The Waluigi Effect (mega-post)

Cleo Nardo3 Mar 2023 3:22 UTC

626 points

187 comments16 min readLW link

Trying to disambiguate different questions about whether RLHF is “good”

Buck14 Dec 2022 4:03 UTC

106 points

47 comments7 min readLW link 1 review

Paper: The Capacity for Moral Self-Correction in Large Language Models (Anthropic)

LawrenceC16 Feb 2023 19:47 UTC

65 points

9 comments1 min readLW link

(arxiv.org)

The True Story of How GPT-2 Became Maximally Lewd

Writer and Jai

18 Jan 2024 21:03 UTC

70 points

7 comments6 min readLW link

(youtu.be)

Run evals on base models too!

orthonormal4 Apr 2024 18:43 UTC

47 points

6 comments1 min readLW link

AXRP Episode 33 - RLHF Problems with Scott Emmons

DanielFilan12 Jun 2024 3:30 UTC

34 points

0 comments56 min readLW link

A “Bitter Lesson” Approach to Aligning AGI and ASI

RogerDearnaley6 Jul 2024 1:23 UTC

53 points

38 comments24 min readLW link

Towards Understanding Sycophancy in Language Models

Ethan Perez, mrinank_sharma, Meg and Tomek Korbak

24 Oct 2023 0:30 UTC

66 points

0 comments2 min readLW link

(arxiv.org)

Paul Christiano on Dwarkesh Podcast

ESRogs3 Nov 2023 22:13 UTC

17 points

0 comments1 min readLW link

(www.dwarkeshpatel.com)

Mode collapse in RL may be fueled by the update equation

TurnTrout and MichaelEinhorn

19 Jun 2023 21:51 UTC

49 points

10 comments8 min readLW link

Is behavioral safety “solved” in non-adversarial conditions?

Robert_AIZI25 May 2023 17:56 UTC

26 points

8 comments2 min readLW link

(aizi.substack.com)

MetaAI: less is less for alignment.

Cleo Nardo13 Jun 2023 14:08 UTC

68 points

17 comments5 min readLW link

AI #23: Fundamental Problems with RLHF

Zvi3 Aug 2023 12:50 UTC

59 points

9 comments41 min readLW link

(thezvi.wordpress.com)

[Question] Beginner’s question about RLHF

FTPickle8 Aug 2023 15:48 UTC

1 point

3 comments1 min readLW link

[Question] Don’t you think RLHF solves outer alignment?

Charbel-Raphaël4 Nov 2022 0:36 UTC

9 points

23 comments1 min readLW link

A philosopher’s critique of RLHF

ThomasW7 Nov 2022 2:42 UTC

55 points

8 comments2 min readLW link

Mysteries of mode collapse

janus8 Nov 2022 10:37 UTC

282 points

57 comments14 min readLW link 1 review

Update to Mysteries of mode collapse: text-davinci-002 not RLHF

janus19 Nov 2022 23:51 UTC

71 points

8 comments2 min readLW link

Steering Behaviour: Testing for (Non-)Myopia in Language Models

Evan R. Murphy and Megan Kinniment

5 Dec 2022 20:28 UTC

40 points

19 comments10 min readLW link

Take 9: No, RLHF/IDA/debate doesn’t solve outer alignment.

Charlie Steiner12 Dec 2022 11:51 UTC

33 points

14 comments2 min readLW link

Take 10: Fine-tuning with RLHF is aesthetically unsatisfying.

Charlie Steiner13 Dec 2022 7:04 UTC

37 points

3 comments2 min readLW link

Paper: Constitutional AI: Harmlessness from AI Feedback (Anthropic)

LawrenceC16 Dec 2022 22:12 UTC

68 points

11 comments1 min readLW link

(www.anthropic.com)

[Link] Why I’m optimistic about OpenAI’s alignment approach

janleike5 Dec 2022 22:51 UTC

98 points

15 comments1 min readLW link

(aligned.substack.com)

Take 13: RLHF bad, conditioning good.

Charlie Steiner22 Dec 2022 10:44 UTC

54 points

4 comments2 min readLW link

Model-driven feedback could amplify alignment failures

aogara30 Jan 2023 0:00 UTC

21 points

1 comment2 min readLW link

A library for safety research in conditioning on RLHF tasks

James Chua26 Feb 2023 14:50 UTC

10 points

2 comments1 min readLW link

RLHF does not appear to differentially cause mode-collapse

Arthur Conmy and beren

20 Mar 2023 15:39 UTC

95 points

9 comments3 min readLW link

Challenge proposal: smallest possible self-hardening backdoor for RLHF

Christopher King29 Jun 2023 16:56 UTC

7 points

0 comments2 min readLW link

Continuous Adversarial Quality Assurance: Extending RLHF and Constitutional AI

Benaya Koren8 Jul 2023 17:32 UTC

6 points

0 comments9 min readLW link

Open Problems and Fundamental Limitations of RLHF

scasper31 Jul 2023 15:31 UTC

66 points

6 comments2 min readLW link

(arxiv.org)

On the Importance of Open Sourcing Reward Models

elandgre2 Jan 2023 19:01 UTC

18 points

5 comments6 min readLW link

LoRA Fine-tuning Efficiently Undoes Safety Training from Llama 2-Chat 70B

Simon Lermen and Jeffrey Ladish

12 Oct 2023 19:58 UTC

148 points

29 comments14 min readLW link

Censorship in LLMs is here to stay because it mirrors how our own intelligence is structured

mnvr5 Oct 2023 17:37 UTC

3 points

0 comments1 min readLW link

unRLHF—Efficiently undoing LLM safeguards

Pranav Gade, Jeffrey Ladish and Simon Lermen

12 Oct 2023 19:58 UTC

117 points

15 comments20 min readLW link

VLM-RM: Specifying Rewards with Natural Language

ChengCheng, David Lindner and Ethan Perez

23 Oct 2023 14:11 UTC

20 points

2 comments5 min readLW link

(far.ai)

Recommend HAIST resources for assessing the value of RLHF-related alignment research

Sam Marks and Xander Davies

5 Nov 2022 20:58 UTC

26 points

9 comments3 min readLW link

Optimality is the tiger, and annoying the user is its teeth

Christopher King28 Jan 2023 20:20 UTC

25 points

5 comments2 min readLW link

Learning from Human Preferences—from OpenAI (including Christiano, Amodei & Legg)

Dr_Manhattan13 Jun 2017 15:52 UTC

17 points

12 comments1 min readLW link

(blog.openai.com)

Human preferences as RL critic values—implications for alignment

Seth Herd14 Mar 2023 22:10 UTC

26 points

6 comments6 min readLW link

A first success story for Outer Alignment: InstructGPT

Noosphere898 Nov 2022 22:52 UTC

6 points

1 comment1 min readLW link

(openai.com)

Pretraining Language Models with Human Preferences

Tomek Korbak, Sam Bowman and Ethan Perez

21 Feb 2023 17:57 UTC

134 points

19 comments11 min readLW link

[ASoT] Finetuning, RL, and GPT’s world prior

Jozdien2 Dec 2022 16:33 UTC

44 points

8 comments5 min readLW link

Validator models: A simple approach to detecting goodharting

beren20 Feb 2023 21:32 UTC

14 points

1 comment4 min readLW link

[Preprint] Pretraining Language Models with Human Preferences

Giulio21 Feb 2023 11:44 UTC

12 points

0 comments1 min readLW link

(arxiv.org)

The case for more ambitious language model evals

Jozdien30 Jan 2024 0:01 UTC

109 points

28 comments5 min readLW link

[Question] Will research in AI risk jinx it? Consequences of training AI on AI risk arguments

Yann Dubois19 Dec 2022 22:42 UTC

5 points

6 comments1 min readLW link

Imitation Learning from Language Feedback

Jérémy Scheurer, Tomek Korbak and Ethan Perez

30 Mar 2023 14:11 UTC

71 points

3 comments10 min readLW link

GPT-4 busted? Clear self-interest when summarizing articles about itself vs when article talks about Claude, LLaMA, or DALL·E 2

Christopher King31 Mar 2023 17:05 UTC

6 points

4 comments4 min readLW link

Exploratory Analysis of RLHF Transformers with TransformerLens

Curt Tigges3 Apr 2023 16:09 UTC

21 points

2 comments11 min readLW link

(blog.eleuther.ai)

Natural language alignment

Jacy Reese Anthis12 Apr 2023 19:02 UTC

30 points

2 comments2 min readLW link

An alternative of PPO towards alignment

ml hkust17 Apr 2023 17:58 UTC

2 points

2 comments4 min readLW link

Proposal: Using Monte Carlo tree search instead of RLHF for alignment research

Christopher King20 Apr 2023 19:57 UTC

2 points

7 comments3 min readLW link

DIY RLHF: A simple implementation for hands on experience

Mike Vaiana and AE Studio

10 Jul 2024 12:07 UTC

28 points

0 comments6 min readLW link

Compositional preference models for aligning LMs

Tomek Korbak25 Oct 2023 12:17 UTC

18 points

2 comments5 min readLW link

Wireheading and misalignment by composition on NetHack

pierlucadoro27 Oct 2023 17:43 UTC

34 points

4 comments4 min readLW link

Why do we need RLHF? Imitation, Inverse RL, and the role of reward

Ran W3 Feb 2024 4:00 UTC

14 points

0 comments5 min readLW link

Scalable And Transferable Black-Box Jailbreaks For Language Models Via Persona Modulation

Soroush Pour, rusheb, Quentin FEUILLADE--MONTIXI, Arush and scasper

7 Nov 2023 17:59 UTC

36 points

2 comments2 min readLW link

(arxiv.org)

Artefacts generated by mode collapse in GPT-4 Turbo serve as adversarial attacks.

Sohaib Imran10 Nov 2023 15:23 UTC

11 points

0 comments2 min readLW link

Reflections On The Feasibility Of Scalable-Oversight

Felix Hofstätter10 Mar 2023 7:54 UTC

11 points

0 comments12 min readLW link

The Compleat Cybornaut

ukc10014, Jozdien and NicholasKees

19 May 2023 8:44 UTC

64 points

2 comments16 min readLW link

RLHF

Ansh Radhakrishnan12 May 2022 21:18 UTC

18 points

5 comments5 min readLW link

No comments.