Archive
Sequences
About
Search
Log In
Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
RSS
Carson Denison
Karma:
692
I work on deceptive alignment and reward hacking at Anthropic
All
Posts
Comments
New
Top
Old
Simple probes can catch sleeper agents
Monte M
,
Carson Denison
,
Zac Hatfield-Dodds
,
David Duvenaud
,
Sam Bowman
,
Ethan Perez
and
evhub
23 Apr 2024 21:10 UTC
119
points
17
comments
1
min read
LW
link
(www.anthropic.com)
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
evhub
,
Carson Denison
,
Meg
,
Monte M
,
David Duvenaud
,
Nicholas Schiefer
and
Ethan Perez
12 Jan 2024 19:51 UTC
298
points
94
comments
3
min read
LW
link
(arxiv.org)
Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research
evhub
,
Nicholas Schiefer
,
Carson Denison
and
Ethan Perez
8 Aug 2023 1:30 UTC
306
points
26
comments
18
min read
LW
link
[Question]
How do I Optimize Team-Matching at Google
Carson Denison
24 Feb 2022 22:10 UTC
8
points
1
comment
1
min read
LW
link
Back to top