RSS

Ethan Perez

Karma: 3,324

I’m a research lead at Anthropic doing safety research on language models. Some of my past work includes introducing automated red teaming of language models [1], showing the usefulness of AI safety via debate [2], demonstrating that chain-of-thought can be unfaithful [3], discovering sycophancy in language models [4], initiating the model organisms of misalignment agenda [5][6], and developing constitutional classifiers and showing they can be used to obtain very high levels of adversarial robustness to jailbreaks [7].

Website: https://​​ethanperez.net/​​

Rea­sons to sell fron­tier lab equity to donate now rather than later

26 Sep 2025 23:07 UTC
244 points
33 comments12 min readLW link

In­verse Scal­ing in Test-Time Compute

22 Jul 2025 22:06 UTC
20 points
2 comments2 min readLW link
(arxiv.org)

Agen­tic Misal­ign­ment: How LLMs Could be In­sider Threats

20 Jun 2025 22:34 UTC
82 points
13 comments6 min readLW link

Un­su­per­vised Elic­i­ta­tion of Lan­guage Models

13 Jun 2025 16:15 UTC
57 points
11 comments2 min readLW link

Mod­ify­ing LLM Beliefs with Syn­thetic Doc­u­ment Finetuning

24 Apr 2025 21:15 UTC
70 points
12 comments2 min readLW link
(alignment.anthropic.com)

Rea­son­ing mod­els don’t always say what they think

9 Apr 2025 19:48 UTC
28 points
4 comments1 min readLW link
(www.anthropic.com)

Au­to­mated Re­searchers Can Subtly Sandbag

26 Mar 2025 19:13 UTC
44 points
0 comments4 min readLW link
(alignment.anthropic.com)

Tips and Code for Em­piri­cal Re­search Workflows

20 Jan 2025 22:31 UTC
96 points
15 comments20 min readLW link

Tips On Em­piri­cal Re­search Slides

8 Jan 2025 5:06 UTC
96 points
4 comments6 min readLW link

A dataset of ques­tions on de­ci­sion-the­o­retic rea­son­ing in New­comb-like problems

16 Dec 2024 22:42 UTC
50 points
1 comment2 min readLW link
(arxiv.org)

Best-of-N Jailbreaking

14 Dec 2024 4:58 UTC
78 points
5 comments2 min readLW link
(arxiv.org)

In­tro­duc­ing the An­thropic Fel­lows Program

30 Nov 2024 23:47 UTC
26 points
0 comments4 min readLW link
(alignment.anthropic.com)

Sab­o­tage Eval­u­a­tions for Fron­tier Models

18 Oct 2024 22:33 UTC
95 points
56 comments6 min readLW link
(assets.anthropic.com)

Re­ward hack­ing be­hav­ior can gen­er­al­ize across tasks

28 May 2024 16:33 UTC
81 points
5 comments21 min readLW link

Sim­ple probes can catch sleeper agents

23 Apr 2024 21:10 UTC
133 points
21 comments1 min readLW link
(www.anthropic.com)

How I se­lect al­ign­ment re­search projects

10 Apr 2024 4:33 UTC
36 points
4 comments24 min readLW link

Tips for Em­piri­cal Align­ment Research

Ethan Perez29 Feb 2024 6:04 UTC
181 points
5 comments23 min readLW link

De­bat­ing with More Per­sua­sive LLMs Leads to More Truth­ful Answers

7 Feb 2024 21:28 UTC
89 points
14 comments9 min readLW link
(arxiv.org)

Sleeper Agents: Train­ing De­cep­tive LLMs that Per­sist Through Safety Training

12 Jan 2024 19:51 UTC
306 points
95 comments3 min readLW link
(arxiv.org)

Towards Eval­u­at­ing AI Sys­tems for Mo­ral Sta­tus Us­ing Self-Reports

16 Nov 2023 20:18 UTC
45 points
3 comments1 min readLW link
(arxiv.org)