RSS

Fabien Roger

Karma: 6,238

I am working on empirical AI safety.

Book a call with me if you want advice on a concrete empirical safety project.

Anonymous feedback form.

Train­ing Qwen-1.5B with a CoT leg­i­bil­ity penalty

Fabien Roger9 Oct 2025 21:33 UTC
64 points
5 comments4 min readLW link

Train­ing fails to elicit sub­tle rea­son­ing in cur­rent lan­guage models

9 Oct 2025 19:04 UTC
48 points
2 comments4 min readLW link
(alignment.anthropic.com)

Inoc­u­la­tion prompt­ing: In­struct­ing mod­els to mis­be­have at train-time can im­prove run-time behavior

8 Oct 2025 22:02 UTC
141 points
26 comments2 min readLW link

Four places where you can put LLM monitoring

9 Aug 2025 23:10 UTC
48 points
0 comments7 min readLW link

Why Do Some Lan­guage Models Fake Align­ment While Others Don’t?

8 Jul 2025 21:49 UTC
158 points
14 comments5 min readLW link
(arxiv.org)

What can be learned from scary de­mos? A snitch­ing case study

Fabien Roger24 Jun 2025 8:40 UTC
22 points
1 comment7 min readLW link

Mod­ify­ing LLM Beliefs with Syn­thetic Doc­u­ment Finetuning

24 Apr 2025 21:15 UTC
70 points
12 comments2 min readLW link
(alignment.anthropic.com)

Rea­son­ing mod­els don’t always say what they think

9 Apr 2025 19:48 UTC
28 points
4 comments1 min readLW link
(www.anthropic.com)

Align­ment Fak­ing Re­vis­ited: Im­proved Clas­sifiers and Open Source Extensions

8 Apr 2025 17:32 UTC
146 points
20 comments12 min readLW link

Au­to­mated Re­searchers Can Subtly Sandbag

26 Mar 2025 19:13 UTC
44 points
0 comments4 min readLW link
(alignment.anthropic.com)

Au­dit­ing lan­guage mod­els for hid­den objectives

13 Mar 2025 19:18 UTC
141 points
15 comments13 min readLW link

Do rea­son­ing mod­els use their scratch­pad like we do? Ev­i­dence from dis­till­ing paraphrases

Fabien Roger11 Mar 2025 11:52 UTC
127 points
23 comments11 min readLW link
(alignment.anthropic.com)

Fuzzing LLMs some­times makes them re­veal their secrets

Fabien Roger26 Feb 2025 16:48 UTC
64 points
13 comments9 min readLW link

How to repli­cate and ex­tend our al­ign­ment fak­ing demo

Fabien Roger19 Dec 2024 21:44 UTC
114 points
5 comments2 min readLW link
(alignment.anthropic.com)

Align­ment Fak­ing in Large Lan­guage Models

18 Dec 2024 17:19 UTC
489 points
75 comments10 min readLW link

A toy eval­u­a­tion of in­fer­ence code tampering

Fabien Roger9 Dec 2024 17:43 UTC
52 points
0 comments9 min readLW link
(alignment.anthropic.com)

The case for un­learn­ing that re­moves in­for­ma­tion from LLM weights

Fabien Roger14 Oct 2024 14:08 UTC
102 points
18 comments6 min readLW link

[Question] Is cy­ber­crime re­ally cost­ing trillions per year?

Fabien Roger27 Sep 2024 8:44 UTC
63 points
28 comments1 min readLW link

An is­sue with train­ing schemers with su­per­vised fine-tuning

Fabien Roger27 Jun 2024 15:37 UTC
49 points
12 comments6 min readLW link

Best-of-n with mis­al­igned re­ward mod­els for Math reasoning

Fabien Roger21 Jun 2024 22:53 UTC
25 points
0 comments3 min readLW link