evhub

Karma: 15,161

Evan Hubinger (he/him/his) (evanjhub@gmail.com)

Head of Alignment Stress-Testing at Anthropic. My posts and comments are my own and do not represent Anthropic’s positions, policies, strategies, or opinions.

Previously: MIRI, OpenAI

See: “Why I’m joining Anthropic”

Selected work:

Alignment remains a hard, unsolved problem

evhub27 Nov 2025 8:45 UTC

75 points

2 comments13 min readLW link

Evaluating honesty and lie detection techniques on a diverse suite of dishonest models

RowanWang, Sam Marks, Johannes Treutlein, evhub and Fabien Roger

25 Nov 2025 19:33 UTC

32 points

0 comments4 min readLW link

(alignment.anthropic.com)

Natural emergent misalignment from reward hacking in production RL

evhub, Monte M, Benjamin Wright and Jonathan Uesato

21 Nov 2025 20:00 UTC

224 points

30 comments9 min readLW link

Building and evaluating alignment auditing agents

Sam Marks, trentbrick, RowanWang, Sam Bowman, Euan Ong, Johannes Treutlein and evhub

24 Jul 2025 19:22 UTC

47 points

1 comment5 min readLW link

Agentic Misalignment: How LLMs Could be Insider Threats

Aengus Lynch, Benjamin Wright, Ethan Perez and evhub

20 Jun 2025 22:34 UTC

82 points

13 comments6 min readLW link

Auditing language models for hidden objectives

Sam Marks, Johannes Treutlein, dmz, Sam Bowman, Hoagy, Carson Denison, Kei Nishimura-Gasparian, 7vik, Akbir Khan, Austin Meek, Euan Ong, Christopher Olah, Fabien Roger, jeanne_, Meg, Drake Thomas, Adam Jermyn, Monte M and evhub

13 Mar 2025 19:18 UTC

142 points

15 comments13 min readLW link

Training on Documents About Reward Hacking Induces Reward Hacking

evhub and Nathan Hu

21 Jan 2025 21:32 UTC

131 points

15 comments2 min readLW link

(alignment.anthropic.com)

Alignment Faking in Large Language Models

ryan_greenblatt, evhub, Carson Denison, Benjamin Wright, Fabien Roger, Monte M, Sam Marks, Johannes Treutlein, Sam Bowman and Buck

18 Dec 2024 17:19 UTC

490 points

75 comments10 min readLW link

Catastrophic sabotage as a major threat model for human-level AI systems

evhub22 Oct 2024 20:57 UTC

93 points

13 comments15 min readLW link

Sabotage Evaluations for Frontier Models

David Duvenaud, Joe Benton, Sam Bowman, evhub, mishajw, Eric Christiansen, HoldenKarnofsky, Ethan Perez and Buck

18 Oct 2024 22:33 UTC

95 points

56 comments6 min readLW link

(assets.anthropic.com)

Automating LLM Auditing with Developmental Interpretability

htlou and evhub

4 Sep 2024 15:50 UTC

19 points

0 comments3 min readLW link

Sycophancy to subterfuge: Investigating reward tampering in large language models

Carson Denison and evhub

17 Jun 2024 18:41 UTC

163 points

22 comments8 min readLW link

(arxiv.org)

Reward hacking behavior can generalize across tasks

Kei Nishimura-Gasparian, Isaac Dunn, Henry Sleight, Miles Turpin, evhub, Carson Denison and Ethan Perez

28 May 2024 16:33 UTC

81 points

5 comments21 min readLW link

Uncovering Deceptive Tendencies in Language Models: A Simulated Company AI Assistant

Olli Järviniemi and evhub

6 May 2024 7:07 UTC

95 points

13 comments1 min readLW link

(arxiv.org)

Simple probes can catch sleeper agents

Monte M, Carson Denison, Zac Hatfield-Dodds, David Duvenaud, Sam Bowman, Ethan Perez and evhub

23 Apr 2024 21:10 UTC

133 points

21 comments1 min readLW link

(www.anthropic.com)

Inducing Unprompted Misalignment in LLMs

Sam Svenningsen, evhub and Henry Sleight

19 Apr 2024 20:00 UTC

38 points

7 comments16 min readLW link

Measuring Predictability of Persona Evaluations

Thee Ho and evhub

6 Apr 2024 8:46 UTC

20 points

0 comments7 min readLW link

How to train your own “Sleeper Agents”

evhub7 Feb 2024 0:31 UTC

93 points

11 comments2 min readLW link

Introducing Alignment Stress-Testing at Anthropic

evhub12 Jan 2024 23:51 UTC

182 points

23 comments2 min readLW link

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

evhub, Carson Denison, Meg, Monte M, David Duvenaud, Nicholas Schiefer and Ethan Perez

12 Jan 2024 19:51 UTC

306 points

95 comments3 min readLW link

(arxiv.org)