RSS

Nicholas Schiefer(Nicholas Schiefer)

Karma: 637

Sleeper Agents: Train­ing De­cep­tive LLMs that Per­sist Through Safety Training

12 Jan 2024 19:51 UTC
288 points
94 comments3 min readLW link
(arxiv.org)

Model Or­ganisms of Misal­ign­ment: The Case for a New Pillar of Align­ment Research

8 Aug 2023 1:30 UTC
305 points
26 comments18 min readLW link

Eng­ineer­ing Monose­man­tic­ity in Toy Models

18 Nov 2022 1:43 UTC
75 points
7 comments3 min readLW link
(arxiv.org)

ELK Pro­posal—Make the Re­porter care about the Pre­dic­tor’s beliefs

11 Jun 2022 22:53 UTC
8 points
0 comments6 min readLW link