Ethan Perez(Ethan Perez)

Karma: 2,388

I’m a research scientist at Anthropic doing empirical safety research on language models. In the past, I’ve worked on automated red teaming of language models [1], the inverse scaling prize [2], learning from human feedback [3][4], and empirically testing debate [5][6], iterated amplification [7], and other methods [8] for scalably supervising AI systems as they become more capable.

Website: https://ethanperez.net/

Reward hacking behavior can generalize across tasks

Kei Nishimura-Gasparian, Isaac Dunn, Henry Sleight, Miles Turpin, evhub, Carson Denison and Ethan Perez

28 May 2024 16:33 UTC

74 points

5 comments21 min readLW link

Simple probes can catch sleeper agents

Monte M, Carson Denison, Zac Hatfield-Dodds, David Duvenaud, Sam Bowman, Ethan Perez and evhub

23 Apr 2024 21:10 UTC

127 points

18 comments1 min readLW link

(www.anthropic.com)

Ethan Perez 12 Apr 2024 20:58 UTC
LW: 26 AF: 14
16
AF
in reply to: johnswentworth’s comment on: How I select alignment research projects
Yeah, some caveats I should’ve added in the interview:
1. Don’t listen to my project selection advice if you don’t like my research
2. The forward-chaining -style approach I’m advocating for is controversial among the alignment forum community (and less controversial in the ML/LLM research community and to some extent among LLM alignment groups)
  1. Part of why I like this approach is that I (personally) think there are at least some somewhat promising agendas out there, that aren’t getting executed on enough (or much at all), and it’s doable to e.g. double the amount of good work happening on some agenda by executing quickly/well
  2. If you don’t think existing agendas are that promising (or think they have more work done on them than they deserve), then this is the wrong approach
3. The back-chaining approach I’m advocating for is pretty standard in the alignment community, I think most alignment forum community researchers would probably endorse it. I’m also excited about this approach to research as well, and have done some work in this way as well (e.g., sleepers agents and model organisms of misalignment)
I’m guessing part of the disagreement here is coming from disagreement on how much alignment progress is idea/agenda bottlenecked vs. execution bottlenecked. I really like Tim Dettmer’s blog post on credit assignment in research, which has a good framework for thinking about when you’ll have more counterfactual impact working on ideas vs. working on execution.

How I select alignment research projects

Ethan Perez, Henry Sleight and Mikita Balesni

10 Apr 2024 4:33 UTC

34 points

4 comments24 min readLW link

Ethan Perez 3 Mar 2024 4:38 UTC
LW: 7 AF: 1
1
AF
in reply to: qxcv’s comment on: Tips for Empirical Alignment Research
Yeah, I think this is one of the ways that velocity is really helpful. I’d probably add one caveat specific to research on LLMs, which is that, since the field/capabilities are moving so quickly, there’s much, much more low-hanging fruit in empirical research than almost any other field of research. This means that, for LLM research specifically, you should rarely be in a swamp, because that means that you’ve probably run through the low-hanging fruit on that problem/approach, and there’s other low-hanging in other areas that you probably want to be picking instead.
(High velocity is great for both picking low-hanging fruit and for getting through swamps when you really need to solve a particular problem, so it’s useful to have either way)

Tips for Empirical Alignment Research

Ethan Perez29 Feb 2024 6:04 UTC

147 points

4 comments22 min readLW link

Debating with More Persuasive LLMs Leads to More Truthful Answers

Akbir Khan, John Hughes, Dan Valentine, Sam Bowman and Ethan Perez

7 Feb 2024 21:28 UTC

87 points

14 comments9 min readLW link

(arxiv.org)

Ethan Perez 13 Jan 2024 20:38 UTC
LW: 25 AF: 12
10
AF
in reply to: TurnTrout’s comment on: Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
Fourth, I have a bunch of dread about the million conversations I will have to have with people explaining these results. I think that predictably, people will update as if they saw actual deceptive alignment,
Have you seen this on twitter, AF comments, or other discussion? I’d be interested if so. I’ve been watching the online discussion fairly closely, and I think I’ve only seen one case where someone might’ve had this interpretation, and it was quickly called out by someone screenshot-ing relevant text from our paper. (I was actually worried about this concern but then updated against it after not seeing it come up basically at all in the discussions I’ve seen).
Almost all of the misunderstanding of the paper I’m seeing is actually in the opposite direction “why are you even concerned if you explicitly trained the bad behavior into the model in the first place?” suggesting that it’s pretty salient to people that we explicitly trained for this (e.g., from the paper title).

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

evhub, Carson Denison, Meg, Monte M, David Duvenaud, Nicholas Schiefer and Ethan Perez

12 Jan 2024 19:51 UTC

299 points

95 comments3 min readLW link

(arxiv.org)

Towards Evaluating AI Systems for Moral Status Using Self-Reports

Ethan Perez and Robbo

16 Nov 2023 20:18 UTC

45 points

3 comments1 min readLW link

(arxiv.org)

Ethan Perez 8 Nov 2023 3:46 UTC
14 points
10
on: Announcing Athena—Women in AI Alignment Research
I’m so excited for this, thank you for setting this up!!

Towards Understanding Sycophancy in Language Models

Ethan Perez, mrinank_sharma, Meg and Tomek Korbak

24 Oct 2023 0:30 UTC

66 points

0 comments2 min readLW link

(arxiv.org)

VLM-RM: Specifying Rewards with Natural Language

ChengCheng, David Lindner and Ethan Perez

23 Oct 2023 14:11 UTC

20 points

2 comments5 min readLW link

(far.ai)

Ethan Perez 29 Aug 2023 20:00 UTC
LW: 9 AF: 4
3
AF
in reply to: leogao’s comment on: OpenAI API base models are not sycophantic, at any size
Are you measuring the average probability the model places on the sycophantic answer, or the % of cases where the probability on the sycophantic answer exceeds the probability of the non-sycophantic answer? (I’d be interested to know both)

Ethan Perez 29 Aug 2023 19:55 UTC
LW: 17 AF: 8
1
AF
on: OpenAI API base models are not sycophantic, at any size
Are you measuring the average probability the model places on the sycophantic answer, or the % of cases where the probability on the sycophantic answer exceeds the probability of the non-sycophantic answer? In our paper, we did the latter; someone mentioned to me that it looks like the colab you linked does the former (though I haven’t checked myself). If this is correct, I think this could explain the differences between your plots and mine in the paper; if pretrained LLMs are placing more probability on the sycophantic answer, I probably wouldn’t expect them to place that much more probability on the sycophantic than non-sycophantic answer (since cross-entropy loss is mode-covering).
(Cool you’re looking into this!)

Ethan Perez 29 Aug 2023 6:35 UTC
LW: 4 AF: 3
2
AF
in reply to: shash42’s comment on: Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research
Generating clear explanations via simulation is definitely not the same as being able to execute it, I agree. I think it’s only a weak indicator / weakly suggestive evidence that now is a good time to start looking for these phenomena. I think being able to generate explanations of deceptive alignment is most likely a pre-requisite to deceptive alignment, since there’s emerging evidence that models can transfer from descriptions of behaviors to actually executing on those behaviors (e.g., upcoming work from Owain Evans and collaborators, and this paper on out of context meta learning). In general, we want to start looking for evidence of deceptive alignment before it’s actually a problem, and “whether or not the model can explain deceptive alignment” seems like a half-reasonable bright line we could use to estimate when it’s time to start looking for it, in lieu of other evidence (though deceptive alignment could also certainly happen before then too).
(Separately, I would be pretty surprised if deceptive alignment descriptions didn’t occur in the GPT3.5 training corpus, e.g., since arXiv is often included as a dataset in pretraining papers, and e.g., the original deceptive alignment paper was on arXiv.)

Ethan Perez 8 Aug 2023 17:33 UTC
LW: 3 AF: 2
0
AF
in reply to: Chris_Leong’s comment on: Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research
Fixed (those were just links to the rest of the doc)

Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research

evhub, Nicholas Schiefer, Carson Denison and Ethan Perez

8 Aug 2023 1:30 UTC

308 points

26 comments18 min readLW link

Ethan Perez 28 Jul 2023 4:08 UTC
12 points
6
on: Reducing sycophancy and improving honesty via activation steering
This seems like a cool result, nice idea! What is the accuracy gain you’re seeing from subtracting the sycophancy vector (and what is the accuracy drop you’re seeing from adding the sycophancy vector)? I’d be interested to see e.g. a plot of how the TruthfulQA accuracy (y-axis) changes as you increase/decrease the magnitude of the activation vector you add (x-axis)

Ethan Perez 21 Jul 2023 4:48 UTC
LW: 3 AF: 2
1
AF
in reply to: habryka’s comment on: Measuring and Improving the Faithfulness of Model-Generated Reasoning
CoT provides pretty little safety guarantee at the relevant scales
Even if faithfulness goes down at some model scale for a given task, that doesn’t mean that we’ll be using models at that scale (e.g., for cost reasons or since we might not have models at a large scale yet). The results on the addition task show that there are some task difficulties for which even the largest models we tested don’t start to show lower faithfulness, and people will be pushing the difficulties of the tasks they use models on as they get better. So it seems likely to me that no matter the model scale, people will be using models on some tasks where they’ll have faithful reasoning (e.g., tasks near the edge of that model’s abilities).
It seems that almost everyone will likely just continue using the model with the best performance
If you’re using the model in a high-stakes setting and you’re an aligned actor, it’s nice to be able to make tradeoffs between performance and safety. For example, you might care more about safety properties than raw capabilities if you’re an alignment researcher at an AGI lab who’s trying to make progress on the alignment problem with AIs.