Fabien Roger

Karma: 2,528

Fermi estimation of the impact you might have working on AI safety

Fabien Roger13 May 2022 17:49 UTC

6 points

0 comments1 min readLW link

The impact you might have working on AI safety

Fabien Roger29 May 2022 16:31 UTC

5 points

1 comment4 min readLW link

How To Know What the AI Knows—An ELK Distillation

Fabien Roger4 Sep 2022 0:46 UTC

7 points

0 comments5 min readLW link

A Mystery About High Dimensional Concept Encoding

Fabien Roger3 Nov 2022 17:05 UTC

46 points

13 comments7 min readLW link

By Default, GPTs Think In Plain Sight

Fabien Roger19 Nov 2022 19:15 UTC

85 points

33 comments9 min readLW link

Extracting and Evaluating Causal Direction in LLMs’ Activations

Fabien Roger and simeon_c

14 Dec 2022 14:33 UTC

29 points

5 comments11 min readLW link

The Translucent Thoughts Hypotheses and Their Implications

Fabien Roger9 Mar 2023 16:30 UTC

133 points

7 comments19 min readLW link

Some ML-Related Math I Now Understand Better

Fabien Roger9 Mar 2023 16:35 UTC

45 points

4 comments4 min readLW link

What Discovering Latent Knowledge Did and Did Not Find

Fabien Roger13 Mar 2023 19:29 UTC

164 points

16 comments11 min readLW link

How Do Induction Heads Actually Work in Transformers With Finite Capacity?

Fabien Roger23 Mar 2023 9:09 UTC

27 points

0 comments5 min readLW link

LLMs Sometimes Generate Purely Negatively-Reinforced Text

Fabien Roger16 Jun 2023 16:31 UTC

176 points

11 comments7 min readLW link

Simplified bio-anchors for upper bounds on AI timelines

Fabien Roger15 Jul 2023 18:15 UTC

20 points

4 comments5 min readLW link

Password-locked models: a stress case for capabilities evaluation

Fabien Roger3 Aug 2023 14:53 UTC

144 points

14 comments6 min readLW link

When AI critique works even with misaligned models

Fabien Roger17 Aug 2023 0:12 UTC

23 points

0 comments2 min readLW link

If influence functions are not approximating leave-one-out, how are they supposed to help?

Fabien Roger22 Sep 2023 14:23 UTC

66 points

4 comments3 min readLW link

Will early transformative AIs primarily use text? [Manifold question]

Fabien Roger2 Oct 2023 15:05 UTC

16 points

0 comments3 min readLW link

Programmatic backdoors: DNNs can use SGD to run arbitrary stateful computation

Fabien Roger and Buck

23 Oct 2023 16:37 UTC

101 points

3 comments8 min readLW link

Preventing Language Models from hiding their reasoning

Fabien Roger and ryan_greenblatt

31 Oct 2023 14:34 UTC

107 points

12 comments12 min readLW link

Coup probes: Catching catastrophes with probes trained off-policy

Fabien Roger17 Nov 2023 17:58 UTC

85 points

7 comments14 min readLW link

Some negative steganography results

Fabien Roger9 Dec 2023 20:22 UTC

55 points

5 comments2 min readLW link