Carson Denison

Karma: 1,504

I work on deceptive alignment and reward hacking at Anthropic

Auditing language models for hidden objectives

Sam Marks, Johannes Treutlein, dmz, Sam Bowman, Hoagy, Carson Denison, Kei, 7vik, Akbir Khan, Austin Meek, Euan Ong, Christopher Olah, Fabien Roger, jeanne_, Meg, Drake Thomas, Adam Jermyn, Monte M and evhub

13 Mar 2025 19:18 UTC

141 points

15 comments13 min readLW link

Alignment Faking in Large Language Models

ryan_greenblatt, evhub, Carson Denison, Benjamin Wright, Fabien Roger, Monte M, Sam Marks, Johannes Treutlein, Sam Bowman and Buck

18 Dec 2024 17:19 UTC

489 points

75 comments10 min readLW link

Sycophancy to subterfuge: Investigating reward tampering in large language models

Carson Denison and evhub

17 Jun 2024 18:41 UTC

163 points

22 comments8 min readLW link

(arxiv.org)

Reward hacking behavior can generalize across tasks

Kei, Isaac Dunn, Henry Sleight, Miles Turpin, evhub, Carson Denison and Ethan Perez

28 May 2024 16:33 UTC

81 points

5 comments21 min readLW link

Simple probes can catch sleeper agents

Monte M, Carson Denison, Zac Hatfield-Dodds, David Duvenaud, Sam Bowman, Ethan Perez and evhub

23 Apr 2024 21:10 UTC

133 points

21 comments1 min readLW link

(www.anthropic.com)

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

evhub, Carson Denison, Meg, Monte M, David Duvenaud, Nicholas Schiefer and Ethan Perez

12 Jan 2024 19:51 UTC

305 points

95 comments3 min readLW link

(arxiv.org)

Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research

evhub, Nicholas Schiefer, Carson Denison and Ethan Perez

8 Aug 2023 1:30 UTC

320 points

30 comments18 min readLW link 1 review

[Question] How do I Optimize Team-Matching at Google

Carson Denison24 Feb 2022 22:10 UTC

8 points

1 comment1 min readLW link