Archive
Sequences
About
Search
Log In
Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
RSS
Carson Denison
Karma:
1,504
I work on deceptive alignment and reward hacking at Anthropic
All
Posts
Comments
New
Top
Old
Auditing language models for hidden objectives
Sam Marks
,
Johannes Treutlein
,
dmz
,
Sam Bowman
,
Hoagy
,
Carson Denison
,
Kei
,
7vik
,
Akbir Khan
,
Austin Meek
,
Euan Ong
,
Christopher Olah
,
Fabien Roger
,
jeanne_
,
Meg
,
Drake Thomas
,
Adam Jermyn
,
Monte M
and
evhub
13 Mar 2025 19:18 UTC
141
points
15
comments
13
min read
LW
link
Alignment Faking in Large Language Models
ryan_greenblatt
,
evhub
,
Carson Denison
,
Benjamin Wright
,
Fabien Roger
,
Monte M
,
Sam Marks
,
Johannes Treutlein
,
Sam Bowman
and
Buck
18 Dec 2024 17:19 UTC
489
points
75
comments
10
min read
LW
link
Sycophancy to subterfuge: Investigating reward tampering in large language models
Carson Denison
and
evhub
17 Jun 2024 18:41 UTC
163
points
22
comments
8
min read
LW
link
(arxiv.org)
Reward hacking behavior can generalize across tasks
Kei
,
Isaac Dunn
,
Henry Sleight
,
Miles Turpin
,
evhub
,
Carson Denison
and
Ethan Perez
28 May 2024 16:33 UTC
81
points
5
comments
21
min read
LW
link
Simple probes can catch sleeper agents
Monte M
,
Carson Denison
,
Zac Hatfield-Dodds
,
David Duvenaud
,
Sam Bowman
,
Ethan Perez
and
evhub
23 Apr 2024 21:10 UTC
133
points
21
comments
1
min read
LW
link
(www.anthropic.com)
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
evhub
,
Carson Denison
,
Meg
,
Monte M
,
David Duvenaud
,
Nicholas Schiefer
and
Ethan Perez
12 Jan 2024 19:51 UTC
305
points
95
comments
3
min read
LW
link
(arxiv.org)
Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research
evhub
,
Nicholas Schiefer
,
Carson Denison
and
Ethan Perez
8 Aug 2023 1:30 UTC
320
points
30
comments
18
min read
LW
link
1
review
[Question]
How do I Optimize Team-Matching at Google
Carson Denison
24 Feb 2022 22:10 UTC
8
points
1
comment
1
min read
LW
link
Back to top