RSS

RLHF

TagLast edit: 13 Nov 2022 2:18 UTC by Multicore

Reinforcement Learning from Human Feedback (RLHF) is a machine learning technique where the model’s training signal uses human evaluations of the model’s outputs, rather than labeled data or a ground truth reward signal.

[Link] Why I’m ex­cited about AI-as­sisted hu­man feedback

janleike6 Apr 2022 15:37 UTC
29 points
0 comments1 min readLW link

Thoughts on the im­pact of RLHF research

paulfchristiano25 Jan 2023 17:23 UTC
227 points
101 comments9 min readLW link

The Waluigi Effect (mega-post)

Cleo Nardo3 Mar 2023 3:22 UTC
568 points
164 comments16 min readLW link

Try­ing to dis­am­biguate differ­ent ques­tions about whether RLHF is “good”

Buck14 Dec 2022 4:03 UTC
95 points
45 comments7 min readLW link

[Question] Don’t you think RLHF solves outer al­ign­ment?

Raphaël S4 Nov 2022 0:36 UTC
4 points
23 comments1 min readLW link

A philoso­pher’s cri­tique of RLHF

ThomasW7 Nov 2022 2:42 UTC
55 points
8 comments2 min readLW link

Mys­ter­ies of mode collapse

janus8 Nov 2022 10:37 UTC
254 points
50 comments14 min readLW link

Up­date to Mys­ter­ies of mode col­lapse: text-davinci-002 not RLHF

janus19 Nov 2022 23:51 UTC
70 points
8 comments2 min readLW link

Steer­ing Be­havi­our: Test­ing for (Non-)My­opia in Lan­guage Models

5 Dec 2022 20:28 UTC
38 points
17 comments10 min readLW link

Take 9: No, RLHF/​IDA/​de­bate doesn’t solve outer al­ign­ment.

Charlie Steiner12 Dec 2022 11:51 UTC
33 points
14 comments2 min readLW link

Take 10: Fine-tun­ing with RLHF is aes­thet­i­cally un­satis­fy­ing.

Charlie Steiner13 Dec 2022 7:04 UTC
36 points
3 comments2 min readLW link

Paper: Con­sti­tu­tional AI: Harm­less­ness from AI Feed­back (An­thropic)

LawrenceC16 Dec 2022 22:12 UTC
65 points
11 comments1 min readLW link
(www.anthropic.com)

[Link] Why I’m op­ti­mistic about OpenAI’s al­ign­ment approach

janleike5 Dec 2022 22:51 UTC
96 points
13 comments1 min readLW link
(aligned.substack.com)

Take 13: RLHF bad, con­di­tion­ing good.

Charlie Steiner22 Dec 2022 10:44 UTC
53 points
4 comments2 min readLW link

Model-driven feed­back could am­plify al­ign­ment failures

aogara30 Jan 2023 0:00 UTC
17 points
1 comment2 min readLW link

Paper: The Ca­pac­ity for Mo­ral Self-Cor­rec­tion in Large Lan­guage Models (An­thropic)

LawrenceC16 Feb 2023 19:47 UTC
65 points
9 comments1 min readLW link
(arxiv.org)

A library for safety re­search in con­di­tion­ing on RLHF tasks

James Chua26 Feb 2023 14:50 UTC
10 points
2 comments1 min readLW link

RLHF does not ap­pear to differ­en­tially cause mode-collapse

20 Mar 2023 15:39 UTC
88 points
8 comments3 min readLW link

Recom­mend HAIST re­sources for as­sess­ing the value of RLHF-re­lated al­ign­ment research

5 Nov 2022 20:58 UTC
26 points
9 comments3 min readLW link

Learn­ing from Hu­man Prefer­ences—from OpenAI (in­clud­ing Chris­ti­ano, Amodei & Legg)

Dr_Manhattan13 Jun 2017 15:52 UTC
17 points
12 comments1 min readLW link
(blog.openai.com)

A first suc­cess story for Outer Align­ment: In­struc­tGPT

Noosphere898 Nov 2022 22:52 UTC
6 points
1 comment1 min readLW link
(openai.com)

[ASoT] Fine­tun­ing, RL, and GPT’s world prior

Jozdien2 Dec 2022 16:33 UTC
40 points
8 comments5 min readLW link

[Question] Will re­search in AI risk jinx it? Con­se­quences of train­ing AI on AI risk arguments

Yann Dubois19 Dec 2022 22:42 UTC
5 points
6 comments1 min readLW link

RLHF

Ansh Radhakrishnan12 May 2022 21:18 UTC
18 points
5 comments5 min readLW link

In-Con­text Learn­ing: A Bridge be­tween RL and Ex­pected Utility Maximization

Zachary Robertson31 Dec 2022 21:39 UTC
7 points
0 comments2 min readLW link

On the Im­por­tance of Open Sourc­ing Re­ward Models

elandgre2 Jan 2023 19:01 UTC
17 points
5 comments6 min readLW link

Op­ti­mal­ity is the tiger, and an­noy­ing the user is its teeth

Christopher King28 Jan 2023 20:20 UTC
24 points
5 comments2 min readLW link

Com­pendium of prob­lems with RLHF

Raphaël S29 Jan 2023 11:40 UTC
94 points
12 comments10 min readLW link

Pre­train­ing Lan­guage Models with Hu­man Preferences

21 Feb 2023 17:57 UTC
129 points
16 comments11 min readLW link

Val­ida­tor mod­els: A sim­ple ap­proach to de­tect­ing goodharting

beren20 Feb 2023 21:32 UTC
15 points
1 comment4 min readLW link

[Preprint] Pre­train­ing Lan­guage Models with Hu­man Preferences

thesofakillers21 Feb 2023 11:44 UTC
12 points
0 comments1 min readLW link
(arxiv.org)

Reflec­tions On The Fea­si­bil­ity Of Scal­able-Oversight

Felix Hofstätter10 Mar 2023 7:54 UTC
11 points
0 comments12 min readLW link

Hu­man prefer­ences as RL critic val­ues—im­pli­ca­tions for alignment

Seth Herd14 Mar 2023 22:10 UTC
10 points
5 comments6 min readLW link
No comments.