RSS

Meg

Karma: 579

Sleeper Agents: Train­ing De­cep­tive LLMs that Per­sist Through Safety Training

12 Jan 2024 19:51 UTC
288 points
94 comments3 min readLW link
(arxiv.org)

Steer­ing Llama-2 with con­trastive ac­ti­va­tion additions

2 Jan 2024 0:47 UTC
119 points
29 comments8 min readLW link
(arxiv.org)

Towards Un­der­stand­ing Sy­co­phancy in Lan­guage Models

24 Oct 2023 0:30 UTC
66 points
0 comments2 min readLW link
(arxiv.org)

Paper: LLMs trained on “A is B” fail to learn “B is A”

23 Sep 2023 19:55 UTC
120 points
73 comments4 min readLW link
(arxiv.org)

Paper: On mea­sur­ing situ­a­tional aware­ness in LLMs

4 Sep 2023 12:54 UTC
106 points
16 comments5 min readLW link
(arxiv.org)