RSS

AlexMeinke

Karma: 639

Stress Test­ing De­liber­a­tive Align­ment for Anti-Schem­ing Training

17 Sep 2025 16:59 UTC
106 points
2 comments1 min readLW link
(antischeming.ai)

Abla­tions for “Fron­tier Models are Ca­pable of In-con­text Schem­ing”

17 Dec 2024 23:58 UTC
115 points
1 comment2 min readLW link

Fron­tier Models are Ca­pable of In-con­text Scheming

5 Dec 2024 22:11 UTC
210 points
24 comments7 min readLW link

Train­ing AI agents to solve hard prob­lems could lead to Scheming

19 Nov 2024 0:10 UTC
61 points
12 comments28 min readLW link

Me, My­self, and AI: the Si­tu­a­tional Aware­ness Dataset (SAD) for LLMs

8 Jul 2024 22:24 UTC
109 points
37 comments5 min readLW link

Apollo Re­search 1-year update

29 May 2024 17:44 UTC
93 points
0 comments7 min readLW link

A starter guide for evals

8 Jan 2024 18:24 UTC
55 points
2 comments12 min readLW link
(www.apolloresearch.ai)

Paper: Tell, Don’t Show- Declar­a­tive facts in­fluence how LLMs generalize

19 Dec 2023 19:14 UTC
45 points
4 comments6 min readLW link
(arxiv.org)