Ethan Perez

Karma: 3,324

I’m a research lead at Anthropic doing safety research on language models. Some of my past work includes introducing automated red teaming of language models [1], showing the usefulness of AI safety via debate [2], demonstrating that chain-of-thought can be unfaithful [3], discovering sycophancy in language models [4], initiating the model organisms of misalignment agenda [5][6], and developing constitutional classifiers and showing they can be used to obtain very high levels of adversarial robustness to jailbreaks [7].

Website: https://ethanperez.net/

Reasons to sell frontier lab equity to donate now rather than later

Daniel_Eth, Ethan Perez and ryan_greenblatt

26 Sep 2025 23:07 UTC

244 points

33 comments12 min readLW link

Inverse Scaling in Test-Time Compute

Joe Benton, Ethan Perez and aryopg

22 Jul 2025 22:06 UTC

20 points

2 comments2 min readLW link

(arxiv.org)

Agentic Misalignment: How LLMs Could be Insider Threats

Aengus Lynch, Benjamin Wright, Ethan Perez and evhub

20 Jun 2025 22:34 UTC

82 points

13 comments6 min readLW link

Unsupervised Elicitation of Language Models

Jiaxin Wen, Peter Hase, Sam Marks, Collin, Ethan Perez and janleike

13 Jun 2025 16:15 UTC

57 points

11 comments2 min readLW link

Modifying LLM Beliefs with Synthetic Document Finetuning

RowanWang, Johannes Treutlein, Avery, Ethan Perez, Fabien Roger and Sam Marks

24 Apr 2025 21:15 UTC

70 points

12 comments2 min readLW link

(alignment.anthropic.com)

Reasoning models don’t always say what they think

Joe Benton, Ethan Perez, Vlad Mikulik and Fabien Roger

9 Apr 2025 19:48 UTC

28 points

4 comments1 min readLW link

(www.anthropic.com)

Automated Researchers Can Subtly Sandbag

gasteigerjo, Akbir Khan, Sam Bowman, Vlad Mikulik, Ethan Perez and Fabien Roger

26 Mar 2025 19:13 UTC

44 points

0 comments4 min readLW link

(alignment.anthropic.com)

Tips and Code for Empirical Research Workflows

John Hughes and Ethan Perez

20 Jan 2025 22:31 UTC

96 points

15 comments20 min readLW link

Tips On Empirical Research Slides

James Chua, John Hughes, Ethan Perez and Owain_Evans

8 Jan 2025 5:06 UTC

96 points

4 comments6 min readLW link

A dataset of questions on decision-theoretic reasoning in Newcomb-like problems

Caspar Oesterheld, Ethan Perez and Chi Nguyen

16 Dec 2024 22:42 UTC

50 points

1 comment2 min readLW link

(arxiv.org)

Best-of-N Jailbreaking

John Hughes, saraprice, Aengus Lynch, Rylan Schaeffer, Fazl, Henry Sleight, Ethan Perez and mrinank_sharma

14 Dec 2024 4:58 UTC

78 points

5 comments2 min readLW link

(arxiv.org)

Introducing the Anthropic Fellows Program

Miranda Zhang and Ethan Perez

30 Nov 2024 23:47 UTC

26 points

0 comments4 min readLW link

(alignment.anthropic.com)

Sabotage Evaluations for Frontier Models

David Duvenaud, Joe Benton, Sam Bowman, evhub, mishajw, Eric Christiansen, HoldenKarnofsky, Ethan Perez and Buck

18 Oct 2024 22:33 UTC

95 points

56 comments6 min readLW link

(assets.anthropic.com)

Reward hacking behavior can generalize across tasks

Kei Nishimura-Gasparian, Isaac Dunn, Henry Sleight, Miles Turpin, evhub, Carson Denison and Ethan Perez

28 May 2024 16:33 UTC

81 points

5 comments21 min readLW link

Simple probes can catch sleeper agents

Monte M, Carson Denison, Zac Hatfield-Dodds, David Duvenaud, Sam Bowman, Ethan Perez and evhub

23 Apr 2024 21:10 UTC

133 points

21 comments1 min readLW link

(www.anthropic.com)

How I select alignment research projects

Ethan Perez, Henry Sleight and Mikita Balesni

10 Apr 2024 4:33 UTC

36 points

4 comments24 min readLW link

Tips for Empirical Alignment Research

Ethan Perez29 Feb 2024 6:04 UTC

181 points

5 comments23 min readLW link

Debating with More Persuasive LLMs Leads to More Truthful Answers

Akbir Khan, John Hughes, Dan Valentine, Sam Bowman and Ethan Perez

7 Feb 2024 21:28 UTC

89 points

14 comments9 min readLW link

(arxiv.org)

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

evhub, Carson Denison, Meg, Monte M, David Duvenaud, Nicholas Schiefer and Ethan Perez

12 Jan 2024 19:51 UTC

306 points

95 comments3 min readLW link

(arxiv.org)

Towards Evaluating AI Systems for Moral Status Using Self-Reports

Ethan Perez and Robbo

16 Nov 2023 20:18 UTC

45 points

3 comments1 min readLW link

(arxiv.org)