RSS

Fabien Roger

Karma: 8,393

I am working on empirical AI safety.

Anonymous feedback form.

(Mis)gen­er­al­iza­tion of Helpful-Only Fine-tuning

4 Jun 2026 18:40 UTC
55 points
7 comments11 min readLW link

Clas­sifier Con­text Rot: Mon­i­tor Perfor­mance De­grades with Con­text Length

18 May 2026 14:05 UTC
54 points
1 comment4 min readLW link

How use­ful is cross-do­main gen­er­al­iza­tion for train­ing LLM mon­i­tors?

18 May 2026 13:52 UTC
21 points
0 comments4 min readLW link

A Re­search Agenda for Se­cret Loyalties

13 May 2026 17:34 UTC
35 points
3 comments3 min readLW link

Mea­sur­ing the abil­ity of Opus 4.5 to fool nar­row classifiers

2 May 2026 22:43 UTC
31 points
0 comments8 min readLW link

Poi­son­ing Fine-tun­ing Datasets of Con­sti­tu­tional Classifiers

29 Apr 2026 17:04 UTC
28 points
2 comments11 min readLW link
(alignment.anthropic.com)

Con­trol pro­to­cols don’t always need to know which mod­els are scheming

Fabien Roger26 Apr 2026 19:16 UTC
39 points
1 comment6 min readLW link

Nar­row Se­cret Loy­alty Dodges Black-Box Audits

22 Apr 2026 9:41 UTC
49 points
1 comment13 min readLW link
(arxiv.org)

How Un­mon­i­tored Ex­ter­nal Agents can Sab­o­tage AI labs

9 Apr 2026 18:07 UTC
23 points
0 comments9 min readLW link

Mea­sur­ing and im­prov­ing cod­ing au­dit re­al­ism with de­ploy­ment resources

23 Mar 2026 17:20 UTC
43 points
1 comment10 min readLW link
(alignment.anthropic.com)

Self-At­tri­bu­tion Bias: When AI Mon­i­tors Go Easy on Themselves

6 Mar 2026 21:54 UTC
44 points
5 comments6 min readLW link

Tools to gen­er­ate re­al­is­tic prompts help sur­pris­ingly lit­tle with Petri au­dit realism

1 Mar 2026 8:18 UTC
44 points
2 comments7 min readLW link

3 Challenges and 2 Hopes for the Safety of Un­su­per­vised Elicitation

27 Feb 2026 17:25 UTC
27 points
0 comments10 min readLW link

Re­fusals that could be­come catastrophic

Fabien Roger30 Jan 2026 4:12 UTC
84 points
12 comments7 min readLW link

Elic­it­ing base mod­els with sim­ple un­su­per­vised techniques

23 Jan 2026 18:06 UTC
34 points
2 comments8 min readLW link

Should con­trol down-weight nega­tive net-sab­o­tage-value threats?

Fabien Roger16 Jan 2026 4:18 UTC
35 points
0 comments10 min readLW link

Towards train­ing-time miti­ga­tions for al­ign­ment fak­ing in RL

16 Dec 2025 21:01 UTC
39 points
1 comment5 min readLW link
(alignment.anthropic.com)

Eval­u­at­ing hon­esty and lie de­tec­tion tech­niques on a di­verse suite of dishon­est models

25 Nov 2025 19:33 UTC
41 points
0 comments4 min readLW link
(alignment.anthropic.com)

Think­ing about rea­son­ing mod­els made me less wor­ried about scheming

Fabien Roger20 Nov 2025 18:20 UTC
89 points
7 comments12 min readLW link

Steer­ing Lan­guage Models with Weight Arithmetic

11 Nov 2025 16:30 UTC
88 points
6 comments5 min readLW link