RSS

Fabien Roger

Karma: 6,941

I am working on empirical AI safety.

Book a call with me if you want advice on a concrete empirical safety project.

Anonymous feedback form.

Towards train­ing-time miti­ga­tions for al­ign­ment fak­ing in RL

16 Dec 2025 21:01 UTC
32 points
1 comment5 min readLW link
(alignment.anthropic.com)

Eval­u­at­ing hon­esty and lie de­tec­tion tech­niques on a di­verse suite of dishon­est models

25 Nov 2025 19:33 UTC
40 points
0 comments4 min readLW link
(alignment.anthropic.com)