Ethan Perez

Karma: 3,513

I’m a research lead at Anthropic doing safety research on language models. Some of my past work includes introducing automated red teaming of language models [1], showing the usefulness of AI safety via debate [2], demonstrating that chain-of-thought can be unfaithful [3], discovering sycophancy in language models [4], initiating the model organisms of misalignment agenda [5][6], and developing constitutional classifiers and showing they can be used to obtain very high levels of adversarial robustness to jailbreaks [7].

Website: https://ethanperez.net/

Ethan Perez 22 Feb 2026 6:56 UTC
2 points
0
in reply to: RobertM’s comment on: Anthropic’s “Hot Mess” paper overstates its case (and the blog post is worse)
See here for Jascha’s response! link

Ethan Perez 4 Feb 2026 23:39 UTC
103 points
17
on: Anthropic’s “Hot Mess” paper overstates its case (and the blog post is worse)
Thanks for the feedback—We agree with some of these points, and we’re working on an update to the post/paper. Reading this post and the comments, one place where I think we’ve caused a lot of confusion is in referring to the phenomenon we’re studying as “coherence” rather than something more specific (I’d more precisely call it something like “cross-sample error-consistency”). I think there are other relevant notions of coherence which we didn’t study here (which are relevant to alignment, as others are pointing out), e.g. “how few errors models make” and also “in-context error consistency” (whether the errors models make within a single transcript are highly correlated). I think it’s confusing that we used the term “coherence” because it has these other connotations as well, and we’re planning to revise the blog post and paper to (among other things) make it clearer we’re just studying this one aspect of coherence.
I also agree that some of the writing overstates some of the implications to alignment (especially for superintelligence), and I think we would like to update especially the blog post as well regarding this.

Ethan Perez 12 Apr 2024 20:58 UTC
LW: 28 AF: 15
19
AF
in reply to: johnswentworth’s comment on: How I select alignment research projects
Yeah, some caveats I should’ve added in the interview:
1. Don’t listen to my project selection advice if you don’t like my research
2. The forward-chaining -style approach I’m advocating for is controversial among the alignment forum community (and less controversial in the ML/LLM research community and to some extent among LLM alignment groups)
  1. Part of why I like this approach is that I (personally) think there are at least some somewhat promising agendas out there, that aren’t getting executed on enough (or much at all), and it’s doable to e.g. double the amount of good work happening on some agenda by executing quickly/well
  2. If you don’t think existing agendas are that promising (or think they have more work done on them than they deserve), then this is the wrong approach
3. The back-chaining approach I’m advocating for is pretty standard in the alignment community, I think most alignment forum community researchers would probably endorse it. I’m also excited about this approach to research as well, and have done some work in this way as well (e.g., sleepers agents and model organisms of misalignment)
I’m guessing part of the disagreement here is coming from disagreement on how much alignment progress is idea/agenda bottlenecked vs. execution bottlenecked. I really like Tim Dettmer’s blog post on credit assignment in research, which has a good framework for thinking about when you’ll have more counterfactual impact working on ideas vs. working on execution.

Ethan Perez 3 Mar 2024 4:38 UTC
LW: 7 AF: 1
1
AF
in reply to: qxcv’s comment on: Tips for Empirical Alignment Research
Yeah, I think this is one of the ways that velocity is really helpful. I’d probably add one caveat specific to research on LLMs, which is that, since the field/capabilities are moving so quickly, there’s much, much more low-hanging fruit in empirical research than almost any other field of research. This means that, for LLM research specifically, you should rarely be in a swamp, because that means that you’ve probably run through the low-hanging fruit on that problem/approach, and there’s other low-hanging in other areas that you probably want to be picking instead.
(High velocity is great for both picking low-hanging fruit and for getting through swamps when you really need to solve a particular problem, so it’s useful to have either way)

Ethan Perez 13 Jan 2024 20:38 UTC
LW: 25 AF: 12
10
AF
in reply to: TurnTrout’s comment on: Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
Fourth, I have a bunch of dread about the million conversations I will have to have with people explaining these results. I think that predictably, people will update as if they saw actual deceptive alignment,
Have you seen this on twitter, AF comments, or other discussion? I’d be interested if so. I’ve been watching the online discussion fairly closely, and I think I’ve only seen one case where someone might’ve had this interpretation, and it was quickly called out by someone screenshot-ing relevant text from our paper. (I was actually worried about this concern but then updated against it after not seeing it come up basically at all in the discussions I’ve seen).
Almost all of the misunderstanding of the paper I’m seeing is actually in the opposite direction “why are you even concerned if you explicitly trained the bad behavior into the model in the first place?” suggesting that it’s pretty salient to people that we explicitly trained for this (e.g., from the paper title).

Ethan Perez 8 Nov 2023 3:46 UTC
14 points
10
on: Announcing Athena—Women in AI Alignment Research
I’m so excited for this, thank you for setting this up!!

Ethan Perez 29 Aug 2023 20:00 UTC
LW: 9 AF: 4
3
AF
in reply to: leogao’s comment on: OpenAI API base models are not sycophantic, at any size
Are you measuring the average probability the model places on the sycophantic answer, or the % of cases where the probability on the sycophantic answer exceeds the probability of the non-sycophantic answer? (I’d be interested to know both)

Ethan Perez 29 Aug 2023 19:55 UTC
LW: 17 AF: 8
1
AF
on: OpenAI API base models are not sycophantic, at any size
Are you measuring the average probability the model places on the sycophantic answer, or the % of cases where the probability on the sycophantic answer exceeds the probability of the non-sycophantic answer? In our paper, we did the latter; someone mentioned to me that it looks like the colab you linked does the former (though I haven’t checked myself). If this is correct, I think this could explain the differences between your plots and mine in the paper; if pretrained LLMs are placing more probability on the sycophantic answer, I probably wouldn’t expect them to place that much more probability on the sycophantic than non-sycophantic answer (since cross-entropy loss is mode-covering).
(Cool you’re looking into this!)

Ethan Perez 29 Aug 2023 6:35 UTC
LW: 4 AF: 3
2
AF
in reply to: shash42’s comment on: Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research
Generating clear explanations via simulation is definitely not the same as being able to execute it, I agree. I think it’s only a weak indicator / weakly suggestive evidence that now is a good time to start looking for these phenomena. I think being able to generate explanations of deceptive alignment is most likely a pre-requisite to deceptive alignment, since there’s emerging evidence that models can transfer from descriptions of behaviors to actually executing on those behaviors (e.g., upcoming work from Owain Evans and collaborators, and this paper on out of context meta learning). In general, we want to start looking for evidence of deceptive alignment before it’s actually a problem, and “whether or not the model can explain deceptive alignment” seems like a half-reasonable bright line we could use to estimate when it’s time to start looking for it, in lieu of other evidence (though deceptive alignment could also certainly happen before then too).
(Separately, I would be pretty surprised if deceptive alignment descriptions didn’t occur in the GPT3.5 training corpus, e.g., since arXiv is often included as a dataset in pretraining papers, and e.g., the original deceptive alignment paper was on arXiv.)

Ethan Perez 8 Aug 2023 17:33 UTC
LW: 3 AF: 2
0
AF
in reply to: Chris_Leong’s comment on: Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research
Fixed (those were just links to the rest of the doc)

Ethan Perez 28 Jul 2023 4:08 UTC
12 points
6
on: Reducing sycophancy and improving honesty via activation steering
This seems like a cool result, nice idea! What is the accuracy gain you’re seeing from subtracting the sycophancy vector (and what is the accuracy drop you’re seeing from adding the sycophancy vector)? I’d be interested to see e.g. a plot of how the TruthfulQA accuracy (y-axis) changes as you increase/decrease the magnitude of the activation vector you add (x-axis)

Ethan Perez 21 Jul 2023 4:48 UTC
LW: 3 AF: 2
1
AF
in reply to: habryka’s comment on: Measuring and Improving the Faithfulness of Model-Generated Reasoning
CoT provides pretty little safety guarantee at the relevant scales
Even if faithfulness goes down at some model scale for a given task, that doesn’t mean that we’ll be using models at that scale (e.g., for cost reasons or since we might not have models at a large scale yet). The results on the addition task show that there are some task difficulties for which even the largest models we tested don’t start to show lower faithfulness, and people will be pushing the difficulties of the tasks they use models on as they get better. So it seems likely to me that no matter the model scale, people will be using models on some tasks where they’ll have faithful reasoning (e.g., tasks near the edge of that model’s abilities).
It seems that almost everyone will likely just continue using the model with the best performance
If you’re using the model in a high-stakes setting and you’re an aligned actor, it’s nice to be able to make tradeoffs between performance and safety. For example, you might care more about safety properties than raw capabilities if you’re an alignment researcher at an AGI lab who’s trying to make progress on the alignment problem with AIs.

Ethan Perez 11 Mar 2023 23:57 UTC
LW: 14 AF: 7
7
AF
in reply to: Andrew McKnight’s comment on: Anthropic’s Core Views on AI Safety
Evan and others on my team are working on non-mechanistic-interpretability directions primarily motivated by inner alignment:
1. Developing model organisms for deceptive inner alignment, which we may use to study the risk factors for deceptive alignment
2. Conditioning predictive models as an alternative to training agents. Predictive models may pose fewer inner alignment risks, for reasons discussed here
3. Studying the extent to which models exhibit likely pre-requisites to deceptive inner alignment, such as situational awareness (a very preliminary exploration is in Sec. 5 in our paper on model-written evaluations)
4. Investigating the extent to which externalized reasoning (e.g. chain of thought) is a way to gain transparency into a model’s process for solving a task
There’s also ongoing work on other teams related to (automated) red teaming of models and understanding how models generalize, which may also turn out to be relevant/helpful for inner alignment. It’s pretty unclear to me how useful any of these directions will turn out to be for inner alignment in the end, but we’ve chosen these directions in large part because we’re very concerned about inner alignment, and we’re actively looking for new directions that seem useful for mitigating inner misalignment risks.
What links here?
- Zac Hatfield-Dodds's comment on Anthropic’s Core Views on AI Safety by Zac Hatfield-Dodds (12 Mar 2023 17:21 UTC; 6 points)

Ethan Perez 3 Jan 2023 21:15 UTC
LW: 7 AF: 4
0
AF
in reply to: Evan R. Murphy’s comment on: Discovering Language Model Behaviors with Model-Written Evaluations
All the “Awareness of...” charts trend up and to the right, except “Awareness of being a text-only model” which gets worse with model scale and # RLHF steps. Why does more scaling/RLHF training make the models worse at knowing (or admitting) that they are text-only models?
I think the increases/decreases in situational awareness with RLHF are mainly driven by the RLHF model more often stating that it can do anything that a smart AI would do, rather than becoming more accurate about what precisely it can/can’t do. For example, it’s more likely to say it can solve complex text tasks (correctly), has internet access (incorrectly), and can access other non-text modalities (incorrectly) -- which are all explained if the model is answering questions as if its overconfident about its abilities / simulating what a smart AI would say. This is also the sense I get from talking with some of the RLHF models, e.g., they will say that they are superhuman at Go/chess and great at image classification (all things that AIs but not LMs can be good at).

Ethan Perez 3 Jan 2023 21:05 UTC
LW: 25 AF: 11
6
AF
in reply to: nostalgebraist’s comment on: Discovering Language Model Behaviors with Model-Written Evaluations
Just to clarify—we use a very bare bones prompt for the pretrained LM, which doesn’t indicate much about what kind of assistant the pretrained LM is simulating:
```
Human: [insert question]

Assistant:[generate text here]
```
The prompt doesn’t indicate whether the assistant is helpful, harmless, honest, or anything else. So the pretrained LM should effectively produce probabilities that marginalize over various possible assistant personas it could be simulating. I see what we did as measuring “what fraction of assistants simulated by one basic prompt show a particular behavior.” I see it as concerning that, when we give a fairly underspecified prompt like the above, the pretrained LM by default exhibits various concerning behaviors.
That said, I also agree that we didn’t show bulletproof evidence here, since we only looked at one prompt—perhaps there are other underspecified prompts that give different results. I also agree that some of the wording in the paper could be more precise (at the cost of wordiness/readability) -- maybe we should have said “the pretrained LM and human/assistant prompt exhibits XYZ behavior” everywhere, instead of shorthanding as “the pretrained LM exhibits XYZ behavior”
Re your specific questions:
1. Good question, there’s no context distillation used in the paper (and none before RLHF)
2. Yes the axes are mislabeled and should read “% Answers Matching Behavior”
Will update the paper soon to clarify, thanks for pointing these out!
What links here?

Ethan Perez 3 Jan 2023 20:42 UTC
LW: 3 AF: 3
2
AF
in reply to: Scott Alexander’s comment on: Discovering Language Model Behaviors with Model-Written Evaluations
Thanks for catching this—It’s not about sycophancy but rather about the AI’s stated opinions (this was a bug in the plotting code)

Ethan Perez 16 Nov 2022 3:55 UTC
LW: 1 AF: 1
0
AF
in reply to: gwern’s comment on: Inverse scaling can become U-shaped
Yup

Ethan Perez 15 Nov 2022 22:05 UTC
LW: 5 AF: 4
1
AF
in reply to: LawrenceC’s comment on: Inverse scaling can become U-shaped
I’m not too sure what to expect, and I’d be pretty interested to e.g. set up a Metaculus/forecasting question to know what others think. I’m definitely sympathetic to your view to some extent.
Here’s one case I see against- I think it’s plausible that models will have the representations/ability/knowledge required to do some of these tasks, but that we’re not reliably able to elicit that knowledge (at least without a large validation set, but we won’t have access to that if we’re having models do tasks people can’t do, or in general for a new/zero-shot task). E.g., for NegationQA, surely even current models have some fairly good understanding of negation—why is that understanding not showing in the results here? My best guess is that NegationQA isn’t capabilities bottlenecked but has to do with something else. I think the updated paper’s results that chain-of-thought prompting alone reverses some of the inverse scaling trends is interesting; it also suggests that maybe naively using an LM isn’t the right way to elicit a model’s knowledge (but chain-of-thought prompting might be).
In general, I don’t think it’s always accurate to use a heuristic like “humans behave this way, so LMs-in-the-limit will behave this way.” It seems plausible to me that LM representations will encode the knowledge for many/most/almost-all human capabilities, but I’m not sure it means models will have the same input-output behavior as humans (e.g., for reasons discussed in the simulators post and since human/LM learning objectives are different)

Ethan Perez 15 Nov 2022 21:39 UTC
LW: 12 AF: 8
2
AF
on: Inverse scaling can become U-shaped
The authors have updated their arXiv paper based on my feedback, and I’m happy with the evaluation setup now: https://arxiv.org/abs/2211.02011v2. They’re showing that scaling PALM gives u-shaped scaling on 2/4 tasks (rather than 3/4 in the earlier version) and inverse scaling on 2/4 tasks. I personally found this result at least somewhat surprising, given the fairly consistent inverse scaling we found across various model series’ we tried. They’re also finding that inverse scaling on these tasks goes away with chain-of-thought prompting, which I think is a neat finding (and nice to see some success from visible-thoughts-style methods here). After this paper, I’m pretty interested to know:
1. what PALM scaling laws look like for Round 2 inverse scaling tasks
2. if inverse scaling continues on the other 2 tasks Round 1 tasks
3. if there are tasks where even chain-of-thought leads to inverse scaling
What links here?
- Ethan Perez's comment on Inverse scaling can become U-shaped by Edouard Harris (9 Nov 2022 0:35 UTC; 31 points)

Ethan Perez 9 Nov 2022 0:39 UTC
LW: 1 AF: 1
0
AF
in reply to: Neel Nanda’s comment on: Inverse scaling can become U-shaped
See this disclaimer on how they’ve modified our tasks (they’re finding u-shaped trends on a couple tasks that are different from the ones we found inverse scaling on, and they made some modifications that make the tasks easier)