RSS

Jérémy Scheurer

Karma: 1,209

Claude Son­net 3.7 (of­ten) knows when it’s in al­ign­ment evaluations

17 Mar 2025 19:11 UTC
181 points
7 comments6 min readLW link

Fore­cast­ing Fron­tier Lan­guage Model Agent Capabilities

24 Feb 2025 16:51 UTC
35 points
0 comments5 min readLW link
(www.apolloresearch.ai)

Abla­tions for “Fron­tier Models are Ca­pable of In-con­text Schem­ing”

17 Dec 2024 23:58 UTC
115 points
1 comment2 min readLW link

Fron­tier Models are Ca­pable of In-con­text Scheming

5 Dec 2024 22:11 UTC
203 points
24 comments7 min readLW link

An Opinionated Evals Read­ing List

15 Oct 2024 14:38 UTC
65 points
0 comments13 min readLW link
(www.apolloresearch.ai)

An­a­lyz­ing Deep­Mind’s Prob­a­bil­is­tic Meth­ods for Eval­u­at­ing Agent Capabilities

22 Jul 2024 16:17 UTC
69 points
0 comments16 min readLW link

Me, My­self, and AI: the Si­tu­a­tional Aware­ness Dataset (SAD) for LLMs

8 Jul 2024 22:24 UTC
109 points
37 comments5 min readLW link

Apollo Re­search 1-year update

29 May 2024 17:44 UTC
93 points
0 comments7 min readLW link

We need a Science of Evals

22 Jan 2024 20:30 UTC
71 points
13 comments9 min readLW link

A starter guide for evals

8 Jan 2024 18:24 UTC
53 points
2 comments12 min readLW link
(www.apolloresearch.ai)

Un­der­stand­ing strate­gic de­cep­tion and de­cep­tive alignment

25 Sep 2023 16:27 UTC
64 points
16 comments7 min readLW link
(www.apolloresearch.ai)

An­nounc­ing Apollo Research

30 May 2023 16:17 UTC
217 points
11 comments8 min readLW link

Imi­ta­tion Learn­ing from Lan­guage Feedback

30 Mar 2023 14:11 UTC
71 points
3 comments10 min readLW link

Prac­ti­cal Pit­falls of Causal Scrubbing

27 Mar 2023 7:47 UTC
87 points
17 comments13 min readLW link