Fabien Roger

Karma: 6,238

I am working on empirical AI safety.

Book a call with me if you want advice on a concrete empirical safety project.

Anonymous feedback form.

Training Qwen-1.5B with a CoT legibility penalty

Fabien Roger9 Oct 2025 21:33 UTC

64 points

5 comments4 min readLW link

Training fails to elicit subtle reasoning in current language models

mishajw, Fabien Roger, Hoagy, gasteigerjo, Joe Benton and Vlad Mikulik

9 Oct 2025 19:04 UTC

48 points

2 comments4 min readLW link

(alignment.anthropic.com)

Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior

Sam Marks, Nevan Wichers, Daniel Tan, Aram Ebtekar, Jozdien, David Africa, Alex Mallen and Fabien Roger

8 Oct 2025 22:02 UTC

141 points

26 comments2 min readLW link

Four places where you can put LLM monitoring

Fabien Roger and Buck

9 Aug 2025 23:10 UTC

48 points

0 comments7 min readLW link

Why Do Some Language Models Fake Alignment While Others Don’t?

abhayesian, John Hughes, Alex Mallen, Jozdien, janus and Fabien Roger

8 Jul 2025 21:49 UTC

158 points

14 comments5 min readLW link

(arxiv.org)

What can be learned from scary demos? A snitching case study

Fabien Roger24 Jun 2025 8:40 UTC

22 points

1 comment7 min readLW link

Modifying LLM Beliefs with Synthetic Document Finetuning

RowanWang, Johannes Treutlein, Avery, Ethan Perez, Fabien Roger and Sam Marks

24 Apr 2025 21:15 UTC

70 points

12 comments2 min readLW link

(alignment.anthropic.com)

Reasoning models don’t always say what they think

Joe Benton, Ethan Perez, Vlad Mikulik and Fabien Roger

9 Apr 2025 19:48 UTC

28 points

4 comments1 min readLW link

(www.anthropic.com)

Alignment Faking Revisited: Improved Classifiers and Open Source Extensions

John Hughes, abhayesian, Akbir Khan and Fabien Roger

8 Apr 2025 17:32 UTC

146 points

20 comments12 min readLW link

Automated Researchers Can Subtly Sandbag

gasteigerjo, Akbir Khan, Sam Bowman, Vlad Mikulik, Ethan Perez and Fabien Roger

26 Mar 2025 19:13 UTC

44 points

0 comments4 min readLW link

(alignment.anthropic.com)

Auditing language models for hidden objectives

Sam Marks, Johannes Treutlein, dmz, Sam Bowman, Hoagy, Carson Denison, Kei, 7vik, Akbir Khan, Austin Meek, Euan Ong, Christopher Olah, Fabien Roger, jeanne_, Meg, Drake Thomas, Adam Jermyn, Monte M and evhub

13 Mar 2025 19:18 UTC

141 points

15 comments13 min readLW link

Do reasoning models use their scratchpad like we do? Evidence from distilling paraphrases

Fabien Roger11 Mar 2025 11:52 UTC

127 points

23 comments11 min readLW link

(alignment.anthropic.com)

Fuzzing LLMs sometimes makes them reveal their secrets

Fabien Roger26 Feb 2025 16:48 UTC

64 points

13 comments9 min readLW link

How to replicate and extend our alignment faking demo

Fabien Roger19 Dec 2024 21:44 UTC

114 points

5 comments2 min readLW link

(alignment.anthropic.com)

Alignment Faking in Large Language Models

ryan_greenblatt, evhub, Carson Denison, Benjamin Wright, Fabien Roger, Monte M, Sam Marks, Johannes Treutlein, Sam Bowman and Buck

18 Dec 2024 17:19 UTC

489 points

75 comments10 min readLW link

A toy evaluation of inference code tampering

Fabien Roger9 Dec 2024 17:43 UTC

52 points

0 comments9 min readLW link

(alignment.anthropic.com)

The case for unlearning that removes information from LLM weights

Fabien Roger14 Oct 2024 14:08 UTC

102 points

18 comments6 min readLW link

[Question] Is cybercrime really costing trillions per year?

Fabien Roger27 Sep 2024 8:44 UTC

63 points

28 comments1 min readLW link

An issue with training schemers with supervised fine-tuning

Fabien Roger27 Jun 2024 15:37 UTC

49 points

12 comments6 min readLW link

Best-of-n with misaligned reward models for Math reasoning

Fabien Roger21 Jun 2024 22:53 UTC

25 points

0 comments3 min readLW link