Archive
Sequences
About
Search
Log In
Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
RSS
Fabien Roger
Karma:
8,041
I am working on empirical AI safety.
Anonymous feedback form
.
All
Posts
Comments
New
Top
Old
Page
1
Poisoning Fine-tuning Datasets of Constitutional Classifiers
Chase Bowers
and
Fabien Roger
29 Apr 2026 17:04 UTC
14
points
1
comment
11
min read
LW
link
(alignment.anthropic.com)
Control protocols don’t always need to know which models are scheming
Fabien Roger
26 Apr 2026 19:16 UTC
38
points
1
comment
6
min read
LW
link
Narrow Secret Loyalty Dodges Black-Box Audits
Alfie Lamerton
and
Fabien Roger
22 Apr 2026 9:41 UTC
48
points
1
comment
13
min read
LW
link
How Unmonitored External Agents can Sabotage AI labs
Elle Najt
and
Fabien Roger
9 Apr 2026 18:07 UTC
18
points
0
comments
9
min read
LW
link
Measuring and improving coding audit realism with deployment resources
Connor Kissane
,
Monte M
and
Fabien Roger
23 Mar 2026 17:20 UTC
43
points
1
comment
10
min read
LW
link
(alignment.anthropic.com)
Self-Attribution Bias: When AI Monitors Go Easy on Themselves
Dipika Khullar
,
Jack Hopkins
and
Fabien Roger
6 Mar 2026 21:54 UTC
43
points
5
comments
6
min read
LW
link
Tools to generate realistic prompts help surprisingly little with Petri audit realism
Connor Kissane
,
Monte M
and
Fabien Roger
1 Mar 2026 8:18 UTC
44
points
2
comments
7
min read
LW
link
3 Challenges and 2 Hopes for the Safety of Unsupervised Elicitation
Callum Canavan
,
Aditya Shrivastava
,
Allison Qi
,
Jonathan Michala
and
Fabien Roger
27 Feb 2026 17:25 UTC
21
points
0
comments
10
min read
LW
link
Refusals that could become catastrophic
Fabien Roger
30 Jan 2026 4:12 UTC
84
points
12
comments
7
min read
LW
link
Eliciting base models with simple unsupervised techniques
Callum Canavan
,
Aditya Shrivastava
,
Allison Qi
,
Tianyi (Alex) Qiu
,
Jonathan Michala
and
Fabien Roger
23 Jan 2026 18:06 UTC
34
points
2
comments
8
min read
LW
link
Should control down-weight negative net-sabotage-value threats?
Fabien Roger
16 Jan 2026 4:18 UTC
35
points
0
comments
10
min read
LW
link
Towards training-time mitigations for alignment faking in RL
Vlad Mikulik
,
gasteigerjo
,
Hoagy
,
Joe Benton
,
Benjamin Wright
,
Jonathan Uesato
,
Monte M
,
Fabien Roger
and
evhub
16 Dec 2025 21:01 UTC
39
points
1
comment
5
min read
LW
link
(alignment.anthropic.com)
Evaluating honesty and lie detection techniques on a diverse suite of dishonest models
RowanWang
,
Sam Marks
,
Johannes Treutlein
,
evhub
and
Fabien Roger
25 Nov 2025 19:33 UTC
41
points
0
comments
4
min read
LW
link
(alignment.anthropic.com)
Thinking about reasoning models made me less worried about scheming
Fabien Roger
20 Nov 2025 18:20 UTC
88
points
7
comments
12
min read
LW
link
Steering Language Models with Weight Arithmetic
Fabien Roger
and
constanzafierro
11 Nov 2025 16:30 UTC
87
points
5
comments
5
min read
LW
link
Rogue internal deployments via external APIs
Fabien Roger
and
Buck
15 Oct 2025 19:34 UTC
34
points
4
comments
6
min read
LW
link
Current Language Models Struggle to Reason in Ciphered Language
Fabien Roger
and
Shiyuan Guo
14 Oct 2025 9:08 UTC
78
points
7
comments
5
min read
LW
link
Training Qwen-1.5B with a CoT legibility penalty
Fabien Roger
9 Oct 2025 21:33 UTC
68
points
7
comments
4
min read
LW
link
Training fails to elicit subtle reasoning in current language models
mishajw
,
Fabien Roger
,
Hoagy
,
gasteigerjo
,
Joe Benton
and
Vlad Mikulik
9 Oct 2025 19:04 UTC
49
points
3
comments
4
min read
LW
link
(alignment.anthropic.com)
Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior
Sam Marks
,
Nevan Wichers
,
Daniel Tan
,
Aram Ebtekar
,
Jozdien
,
David Africa
,
Alex Mallen
and
Fabien Roger
8 Oct 2025 22:02 UTC
176
points
37
comments
2
min read
LW
link
Back to top
Next