RSS

Ethan Perez(Ethan Perez)

Karma: 2,300

I’m a research scientist at Anthropic doing empirical safety research on language models. In the past, I’ve worked on automated red teaming of language models [1], the inverse scaling prize [2], learning from human feedback [3][4], and empirically testing debate [5][6], iterated amplification [7], and other methods [8] for scalably supervising AI systems as they become more capable.

Website: https://​​ethanperez.net/​​

An­nounc­ing the In­verse Scal­ing Prize ($250k Prize Pool)

27 Jun 2022 15:58 UTC
169 points
14 comments7 min readLW link

Tips for Em­piri­cal Align­ment Research

Ethan Perez29 Feb 2024 6:04 UTC
142 points
4 comments22 min readLW link

In­verse Scal­ing Prize: Round 1 Winners

26 Sep 2022 19:57 UTC
93 points
16 comments4 min readLW link
(irmckenzie.co.uk)

Lan­guage Model Align­ment Re­search Internships

Ethan Perez13 Dec 2021 19:53 UTC
74 points
1 comment1 min readLW link

Towards Un­der­stand­ing Sy­co­phancy in Lan­guage Models

24 Oct 2023 0:30 UTC
66 points
0 comments2 min readLW link
(arxiv.org)

We may be able to see sharp left turns coming

3 Sep 2022 2:55 UTC
53 points
29 comments2 min readLW link

Towards Eval­u­at­ing AI Sys­tems for Mo­ral Sta­tus Us­ing Self-Reports

16 Nov 2023 20:18 UTC
45 points
3 comments1 min readLW link
(arxiv.org)

How I se­lect al­ign­ment re­search projects

10 Apr 2024 4:33 UTC
34 points
4 comments24 min readLW link

A Test for Lan­guage Model Consciousness

Ethan Perez25 Aug 2022 19:41 UTC
18 points
14 comments9 min readLW link