RSS

Fabien Roger

Karma: 8,079

I am working on empirical AI safety.

Anonymous feedback form.

Mea­sur­ing the abil­ity of Opus 4.5 to fool nar­row classifiers

2 May 2026 22:43 UTC
30 points
0 comments8 min readLW link

Poi­son­ing Fine-tun­ing Datasets of Con­sti­tu­tional Classifiers

29 Apr 2026 17:04 UTC
28 points
2 comments11 min readLW link
(alignment.anthropic.com)

Con­trol pro­to­cols don’t always need to know which mod­els are scheming

Fabien Roger26 Apr 2026 19:16 UTC
39 points
1 comment6 min readLW link

Nar­row Se­cret Loy­alty Dodges Black-Box Audits

22 Apr 2026 9:41 UTC
48 points
1 comment13 min readLW link

How Un­mon­i­tored Ex­ter­nal Agents can Sab­o­tage AI labs

9 Apr 2026 18:07 UTC
18 points
0 comments9 min readLW link

Mea­sur­ing and im­prov­ing cod­ing au­dit re­al­ism with de­ploy­ment resources

23 Mar 2026 17:20 UTC
43 points
1 comment10 min readLW link
(alignment.anthropic.com)