RSS

evhub

Karma: 14,746

Evan Hubinger (he/​him/​his) (evanjhub@gmail.com)

Head of Alignment Stress-Testing at Anthropic. My posts and comments are my own and do not represent Anthropic’s positions, policies, strategies, or opinions.

Previously: MIRI, OpenAI

See: “Why I’m joining Anthropic

Selected work:

Build­ing and eval­u­at­ing al­ign­ment au­dit­ing agents

24 Jul 2025 19:22 UTC
47 points
1 comment5 min readLW link

Agen­tic Misal­ign­ment: How LLMs Could be In­sider Threats

20 Jun 2025 22:34 UTC
78 points
13 comments6 min readLW link

Au­dit­ing lan­guage mod­els for hid­den objectives

13 Mar 2025 19:18 UTC
141 points
15 comments13 min readLW link

Train­ing on Doc­u­ments About Re­ward Hack­ing In­duces Re­ward Hacking

21 Jan 2025 21:32 UTC
131 points
15 comments2 min readLW link
(alignment.anthropic.com)

Align­ment Fak­ing in Large Lan­guage Models

18 Dec 2024 17:19 UTC
489 points
75 comments10 min readLW link

Catas­trophic sab­o­tage as a ma­jor threat model for hu­man-level AI systems

evhub22 Oct 2024 20:57 UTC
93 points
13 comments15 min readLW link

Sab­o­tage Eval­u­a­tions for Fron­tier Models

18 Oct 2024 22:33 UTC
95 points
56 comments6 min readLW link
(assets.anthropic.com)

Au­tomat­ing LLM Au­dit­ing with Devel­op­men­tal Interpretability

4 Sep 2024 15:50 UTC
19 points
0 comments3 min readLW link

Sy­co­phancy to sub­ter­fuge: In­ves­ti­gat­ing re­ward tam­per­ing in large lan­guage models

17 Jun 2024 18:41 UTC
163 points
22 comments8 min readLW link
(arxiv.org)

Re­ward hack­ing be­hav­ior can gen­er­al­ize across tasks

28 May 2024 16:33 UTC
81 points
5 comments21 min readLW link

Un­cov­er­ing De­cep­tive Ten­den­cies in Lan­guage Models: A Si­mu­lated Com­pany AI Assistant

6 May 2024 7:07 UTC
95 points
13 comments1 min readLW link
(arxiv.org)

Sim­ple probes can catch sleeper agents

23 Apr 2024 21:10 UTC
133 points
21 comments1 min readLW link
(www.anthropic.com)

In­duc­ing Un­prompted Misal­ign­ment in LLMs

19 Apr 2024 20:00 UTC
38 points
7 comments16 min readLW link

Mea­sur­ing Pre­dictabil­ity of Per­sona Evaluations

6 Apr 2024 8:46 UTC
20 points
0 comments7 min readLW link

How to train your own “Sleeper Agents”

evhub7 Feb 2024 0:31 UTC
93 points
11 comments2 min readLW link

In­tro­duc­ing Align­ment Stress-Test­ing at Anthropic

evhub12 Jan 2024 23:51 UTC
182 points
23 comments2 min readLW link

Sleeper Agents: Train­ing De­cep­tive LLMs that Per­sist Through Safety Training

12 Jan 2024 19:51 UTC
305 points
95 comments3 min readLW link
(arxiv.org)

Steer­ing Llama-2 with con­trastive ac­ti­va­tion additions

2 Jan 2024 0:47 UTC
125 points
29 comments8 min readLW link
(arxiv.org)

RSPs are pauses done right

evhub14 Oct 2023 4:06 UTC
164 points
73 comments7 min readLW link1 review

Model Or­ganisms of Misal­ign­ment: The Case for a New Pillar of Align­ment Research

8 Aug 2023 1:30 UTC
322 points
30 comments18 min readLW link1 review