RSS

Ethan Perez

Karma: 2,909

I’m a research scientist at Anthropic doing empirical safety research on language models. In the past, I’ve worked on automated red teaming of language models [1], the inverse scaling prize [2], learning from human feedback [3][4], and empirically testing debate [5][6], iterated amplification [7], and other methods [8] for scalably supervising AI systems as they become more capable.

Website: https://​​ethanperez.net/​​

Imi­ta­tion Learn­ing from Lan­guage Feedback

Mar 30, 2023, 2:11 PM
71 points
3 comments10 min readLW link

Pre­train­ing Lan­guage Models with Hu­man Preferences

Feb 21, 2023, 5:57 PM
135 points
20 comments11 min readLW link2 reviews

In­verse Scal­ing Prize: Se­cond Round Winners

Jan 24, 2023, 8:12 PM
58 points
17 comments15 min readLW link

Dis­cov­er­ing Lan­guage Model Be­hav­iors with Model-Writ­ten Evaluations

Dec 20, 2022, 8:08 PM
100 points
34 comments1 min readLW link
(www.anthropic.com)

In­verse Scal­ing Prize: Round 1 Winners

Sep 26, 2022, 7:57 PM
93 points
16 comments4 min readLW link
(irmckenzie.co.uk)

We may be able to see sharp left turns coming

Sep 3, 2022, 2:55 AM
54 points
29 comments1 min readLW link

A Test for Lan­guage Model Consciousness

Ethan PerezAug 25, 2022, 7:41 PM
18 points
14 comments9 min readLW link

In­tro­duc­ing the Fund for Align­ment Re­search (We’re Hiring!)

Jul 6, 2022, 2:07 AM
62 points
0 comments4 min readLW link

An­nounc­ing the In­verse Scal­ing Prize ($250k Prize Pool)

Jun 27, 2022, 3:58 PM
171 points
14 comments7 min readLW link

RL with KL penalties is bet­ter seen as Bayesian inference

May 25, 2022, 9:23 AM
115 points
17 comments12 min readLW link

Lan­guage Model Align­ment Re­search Internships

Ethan PerezDec 13, 2021, 7:53 PM
74 points
1 comment1 min readLW link