RSS

Fabien Roger

Karma: 7,794

I am working on empirical AI safety.

Book a call with me if you want advice on a concrete empirical safety project.

Anonymous feedback form.

Mea­sur­ing and im­prov­ing cod­ing au­dit re­al­ism with de­ploy­ment resources

23 Mar 2026 17:20 UTC
42 points
1 comment10 min readLW link
(alignment.anthropic.com)

Self-At­tri­bu­tion Bias: When AI Mon­i­tors Go Easy on Themselves

6 Mar 2026 21:54 UTC
43 points
3 comments6 min readLW link

Tools to gen­er­ate re­al­is­tic prompts help sur­pris­ingly lit­tle with Petri au­dit realism

1 Mar 2026 8:18 UTC
44 points
2 comments7 min readLW link

3 Challenges and 2 Hopes for the Safety of Un­su­per­vised Elicitation

27 Feb 2026 17:25 UTC
21 points
0 comments10 min readLW link

Re­fusals that could be­come catastrophic

Fabien Roger30 Jan 2026 4:12 UTC
72 points
12 comments7 min readLW link

Elic­it­ing base mod­els with sim­ple un­su­per­vised techniques

23 Jan 2026 18:06 UTC
34 points
2 comments8 min readLW link

Should con­trol down-weight nega­tive net-sab­o­tage-value threats?

Fabien Roger16 Jan 2026 4:18 UTC
35 points
0 comments10 min readLW link

Towards train­ing-time miti­ga­tions for al­ign­ment fak­ing in RL

16 Dec 2025 21:01 UTC
33 points
1 comment5 min readLW link
(alignment.anthropic.com)

Eval­u­at­ing hon­esty and lie de­tec­tion tech­niques on a di­verse suite of dishon­est models

25 Nov 2025 19:33 UTC
40 points
0 comments4 min readLW link
(alignment.anthropic.com)

Think­ing about rea­son­ing mod­els made me less wor­ried about scheming

Fabien Roger20 Nov 2025 18:20 UTC
88 points
7 comments12 min readLW link

Steer­ing Lan­guage Models with Weight Arithmetic

11 Nov 2025 16:30 UTC
82 points
2 comments5 min readLW link

Rogue in­ter­nal de­ploy­ments via ex­ter­nal APIs

15 Oct 2025 19:34 UTC
34 points
4 comments6 min readLW link

Cur­rent Lan­guage Models Strug­gle to Rea­son in Ciphered Language

14 Oct 2025 9:08 UTC
78 points
7 comments5 min readLW link

Train­ing Qwen-1.5B with a CoT leg­i­bil­ity penalty

Fabien Roger9 Oct 2025 21:33 UTC
68 points
7 comments4 min readLW link

Train­ing fails to elicit sub­tle rea­son­ing in cur­rent lan­guage models

9 Oct 2025 19:04 UTC
49 points
3 comments4 min readLW link
(alignment.anthropic.com)

Inoc­u­la­tion prompt­ing: In­struct­ing mod­els to mis­be­have at train-time can im­prove run-time behavior

8 Oct 2025 22:02 UTC
175 points
37 comments2 min readLW link

Four places where you can put LLM monitoring

9 Aug 2025 23:10 UTC
49 points
0 comments7 min readLW link

Why Do Some Lan­guage Models Fake Align­ment While Others Don’t?

8 Jul 2025 21:49 UTC
158 points
14 comments5 min readLW link
(arxiv.org)

What can be learned from scary de­mos? A snitch­ing case study

Fabien Roger24 Jun 2025 8:40 UTC
28 points
5 comments7 min readLW link

Mod­ify­ing LLM Beliefs with Syn­thetic Doc­u­ment Finetuning

24 Apr 2025 21:15 UTC
70 points
12 comments2 min readLW link
(alignment.anthropic.com)