Fabien Roger

Karma: 8,055

I am working on empirical AI safety.

Anonymous feedback form.

Poisoning Fine-tuning Datasets of Constitutional Classifiers

Chase Bowers and Fabien Roger

29 Apr 2026 17:04 UTC

28 points

2 comments11 min readLW link

(alignment.anthropic.com)

Control protocols don’t always need to know which models are scheming

Fabien Roger26 Apr 2026 19:16 UTC

38 points

1 comment6 min readLW link

Narrow Secret Loyalty Dodges Black-Box Audits

Alfie Lamerton and Fabien Roger

22 Apr 2026 9:41 UTC

48 points

1 comment13 min readLW link

How Unmonitored External Agents can Sabotage AI labs

Elle Najt and Fabien Roger

9 Apr 2026 18:07 UTC

18 points

0 comments9 min readLW link

Measuring and improving coding audit realism with deployment resources

Connor Kissane, Monte M and Fabien Roger

23 Mar 2026 17:20 UTC

43 points

1 comment10 min readLW link

(alignment.anthropic.com)

Self-Attribution Bias: When AI Monitors Go Easy on Themselves

Dipika Khullar, Jack Hopkins and Fabien Roger

6 Mar 2026 21:54 UTC

43 points

5 comments6 min readLW link

Tools to generate realistic prompts help surprisingly little with Petri audit realism

Connor Kissane, Monte M and Fabien Roger

1 Mar 2026 8:18 UTC

44 points

2 comments7 min readLW link

3 Challenges and 2 Hopes for the Safety of Unsupervised Elicitation

Callum Canavan, Aditya Shrivastava, Allison Qi, Jonathan Michala and Fabien Roger

27 Feb 2026 17:25 UTC

21 points

0 comments10 min readLW link

Refusals that could become catastrophic

Fabien Roger30 Jan 2026 4:12 UTC

84 points

12 comments7 min readLW link

Eliciting base models with simple unsupervised techniques

Callum Canavan, Aditya Shrivastava, Allison Qi, Tianyi (Alex) Qiu, Jonathan Michala and Fabien Roger

23 Jan 2026 18:06 UTC

34 points

2 comments8 min readLW link

Should control down-weight negative net-sabotage-value threats?

Fabien Roger16 Jan 2026 4:18 UTC

35 points

0 comments10 min readLW link

Towards training-time mitigations for alignment faking in RL

Vlad Mikulik, gasteigerjo, Hoagy, Joe Benton, Benjamin Wright, Jonathan Uesato, Monte M, Fabien Roger and evhub

16 Dec 2025 21:01 UTC

39 points

1 comment5 min readLW link

(alignment.anthropic.com)

Evaluating honesty and lie detection techniques on a diverse suite of dishonest models

RowanWang, Sam Marks, Johannes Treutlein, evhub and Fabien Roger

25 Nov 2025 19:33 UTC

41 points

0 comments4 min readLW link

(alignment.anthropic.com)

Thinking about reasoning models made me less worried about scheming

Fabien Roger20 Nov 2025 18:20 UTC

88 points

7 comments12 min readLW link

Steering Language Models with Weight Arithmetic

Fabien Roger and constanzafierro

11 Nov 2025 16:30 UTC

87 points

5 comments5 min readLW link

Rogue internal deployments via external APIs

Fabien Roger and Buck

15 Oct 2025 19:34 UTC

34 points

4 comments6 min readLW link

Current Language Models Struggle to Reason in Ciphered Language

Fabien Roger and Shiyuan Guo

14 Oct 2025 9:08 UTC

78 points

7 comments5 min readLW link

Training Qwen-1.5B with a CoT legibility penalty

Fabien Roger9 Oct 2025 21:33 UTC

68 points

7 comments4 min readLW link

Training fails to elicit subtle reasoning in current language models

mishajw, Fabien Roger, Hoagy, gasteigerjo, Joe Benton and Vlad Mikulik

9 Oct 2025 19:04 UTC

49 points

3 comments4 min readLW link

(alignment.anthropic.com)

Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior

Sam Marks, Nevan Wichers, Daniel Tan, Aram Ebtekar, Jozdien, David Africa, Alex Mallen and Fabien Roger

8 Oct 2025 22:02 UTC

176 points

37 comments2 min readLW link