Archive
Sequences
About
Search
Log In
Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
RSS
Fabien Roger
Karma:
2,528
All
Posts
Comments
New
Top
Old
Page
1
Fermi estimation of the impact you might have working on AI safety
Fabien Roger
13 May 2022 17:49 UTC
6
points
0
comments
1
min read
LW
link
The impact you might have working on AI safety
Fabien Roger
29 May 2022 16:31 UTC
5
points
1
comment
4
min read
LW
link
How To Know What the AI Knows—An ELK Distillation
Fabien Roger
4 Sep 2022 0:46 UTC
7
points
0
comments
5
min read
LW
link
A Mystery About High Dimensional Concept Encoding
Fabien Roger
3 Nov 2022 17:05 UTC
46
points
13
comments
7
min read
LW
link
By Default, GPTs Think In Plain Sight
Fabien Roger
19 Nov 2022 19:15 UTC
85
points
33
comments
9
min read
LW
link
Extracting and Evaluating Causal Direction in LLMs’ Activations
Fabien Roger
and
simeon_c
14 Dec 2022 14:33 UTC
29
points
5
comments
11
min read
LW
link
The Translucent Thoughts Hypotheses and Their Implications
Fabien Roger
9 Mar 2023 16:30 UTC
133
points
7
comments
19
min read
LW
link
Some ML-Related Math I Now Understand Better
Fabien Roger
9 Mar 2023 16:35 UTC
45
points
4
comments
4
min read
LW
link
What Discovering Latent Knowledge Did and Did Not Find
Fabien Roger
13 Mar 2023 19:29 UTC
164
points
16
comments
11
min read
LW
link
How Do Induction Heads Actually Work in Transformers With Finite Capacity?
Fabien Roger
23 Mar 2023 9:09 UTC
27
points
0
comments
5
min read
LW
link
LLMs Sometimes Generate Purely Negatively-Reinforced Text
Fabien Roger
16 Jun 2023 16:31 UTC
176
points
11
comments
7
min read
LW
link
Simplified bio-anchors for upper bounds on AI timelines
Fabien Roger
15 Jul 2023 18:15 UTC
20
points
4
comments
5
min read
LW
link
Password-locked models: a stress case for capabilities evaluation
Fabien Roger
3 Aug 2023 14:53 UTC
144
points
14
comments
6
min read
LW
link
When AI critique works even with misaligned models
Fabien Roger
17 Aug 2023 0:12 UTC
23
points
0
comments
2
min read
LW
link
If influence functions are not approximating leave-one-out, how are they supposed to help?
Fabien Roger
22 Sep 2023 14:23 UTC
66
points
4
comments
3
min read
LW
link
Will early transformative AIs primarily use text? [Manifold question]
Fabien Roger
2 Oct 2023 15:05 UTC
16
points
0
comments
3
min read
LW
link
Programmatic backdoors: DNNs can use SGD to run arbitrary stateful computation
Fabien Roger
and
Buck
23 Oct 2023 16:37 UTC
101
points
3
comments
8
min read
LW
link
Preventing Language Models from hiding their reasoning
Fabien Roger
and
ryan_greenblatt
31 Oct 2023 14:34 UTC
107
points
12
comments
12
min read
LW
link
Coup probes: Catching catastrophes with probes trained off-policy
Fabien Roger
17 Nov 2023 17:58 UTC
85
points
7
comments
14
min read
LW
link
Some negative steganography results
Fabien Roger
9 Dec 2023 20:22 UTC
55
points
5
comments
2
min read
LW
link
Back to top
Next