RSS

Carson Denison

Karma: 938

I work on deceptive alignment and reward hacking at Anthropic

Sy­co­phancy to sub­ter­fuge: In­ves­ti­gat­ing re­ward tam­per­ing in large lan­guage models

17 Jun 2024 18:41 UTC
161 points
22 comments8 min readLW link
(arxiv.org)

Re­ward hack­ing be­hav­ior can gen­er­al­ize across tasks

28 May 2024 16:33 UTC
74 points
5 comments21 min readLW link

Sim­ple probes can catch sleeper agents

23 Apr 2024 21:10 UTC
127 points
18 comments1 min readLW link
(www.anthropic.com)

Sleeper Agents: Train­ing De­cep­tive LLMs that Per­sist Through Safety Training

12 Jan 2024 19:51 UTC
299 points
95 comments3 min readLW link
(arxiv.org)

Model Or­ganisms of Misal­ign­ment: The Case for a New Pillar of Align­ment Research

8 Aug 2023 1:30 UTC
308 points
26 comments18 min readLW link

[Question] How do I Op­ti­mize Team-Match­ing at Google

Carson Denison24 Feb 2022 22:10 UTC
8 points
1 comment1 min readLW link