Archive
Sequences
About
Search
Log In
Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
RSS
Carson Denison
Karma:
748
I work on deceptive alignment and reward hacking at Anthropic
All
Posts
Comments
New
Top
Old
Reward hacking behavior can generalize across tasks
Kei Nishimura-Gasparian
,
Isaac Dunn
,
Henry Sleight
,
Miles Turpin
,
evhub
,
Carson Denison
and
Ethan Perez
28 May 2024 16:33 UTC
71
points
0
comments
21
min read
LW
link
Simple probes can catch sleeper agents
Monte M
,
Carson Denison
,
Zac Hatfield-Dodds
,
David Duvenaud
,
Sam Bowman
,
Ethan Perez
and
evhub
23 Apr 2024 21:10 UTC
119
points
17
comments
1
min read
LW
link
(www.anthropic.com)
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
evhub
,
Carson Denison
,
Meg
,
Monte M
,
David Duvenaud
,
Nicholas Schiefer
and
Ethan Perez
12 Jan 2024 19:51 UTC
298
points
95
comments
3
min read
LW
link
(arxiv.org)
Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research
evhub
,
Nicholas Schiefer
,
Carson Denison
and
Ethan Perez
8 Aug 2023 1:30 UTC
306
points
26
comments
18
min read
LW
link
[Question]
How do I Optimize Team-Matching at Google
Carson Denison
24 Feb 2022 22:10 UTC
8
points
1
comment
1
min read
LW
link
Back to top