RSS

Carson Denison

Karma: 748

I work on deceptive alignment and reward hacking at Anthropic

Re­ward hack­ing be­hav­ior can gen­er­al­ize across tasks

28 May 2024 16:33 UTC
71 points
0 comments21 min readLW link

Sim­ple probes can catch sleeper agents

23 Apr 2024 21:10 UTC
119 points
17 comments1 min readLW link
(www.anthropic.com)

Sleeper Agents: Train­ing De­cep­tive LLMs that Per­sist Through Safety Training

12 Jan 2024 19:51 UTC
298 points
95 comments3 min readLW link
(arxiv.org)

Model Or­ganisms of Misal­ign­ment: The Case for a New Pillar of Align­ment Research

8 Aug 2023 1:30 UTC
306 points
26 comments18 min readLW link

[Question] How do I Op­ti­mize Team-Match­ing at Google

Carson Denison24 Feb 2022 22:10 UTC
8 points
1 comment1 min readLW link