RSS

Fabien Roger

Karma: 2,528

Fermi es­ti­ma­tion of the im­pact you might have work­ing on AI safety

Fabien Roger13 May 2022 17:49 UTC
6 points
0 comments1 min readLW link

The im­pact you might have work­ing on AI safety

Fabien Roger29 May 2022 16:31 UTC
5 points
1 comment4 min readLW link

How To Know What the AI Knows—An ELK Distillation

Fabien Roger4 Sep 2022 0:46 UTC
7 points
0 comments5 min readLW link

A Mys­tery About High Di­men­sional Con­cept Encoding

Fabien Roger3 Nov 2022 17:05 UTC
46 points
13 comments7 min readLW link

By De­fault, GPTs Think In Plain Sight

Fabien Roger19 Nov 2022 19:15 UTC
85 points
33 comments9 min readLW link

Ex­tract­ing and Eval­u­at­ing Causal Direc­tion in LLMs’ Activations

14 Dec 2022 14:33 UTC
29 points
5 comments11 min readLW link

The Translu­cent Thoughts Hy­pothe­ses and Their Implications

Fabien Roger9 Mar 2023 16:30 UTC
133 points
7 comments19 min readLW link

Some ML-Re­lated Math I Now Un­der­stand Better

Fabien Roger9 Mar 2023 16:35 UTC
45 points
4 comments4 min readLW link

What Dis­cov­er­ing La­tent Knowl­edge Did and Did Not Find

Fabien Roger13 Mar 2023 19:29 UTC
164 points
16 comments11 min readLW link

How Do In­duc­tion Heads Ac­tu­ally Work in Trans­form­ers With Finite Ca­pac­ity?

Fabien Roger23 Mar 2023 9:09 UTC
27 points
0 comments5 min readLW link

LLMs Some­times Gen­er­ate Purely Nega­tively-Re­in­forced Text

Fabien Roger16 Jun 2023 16:31 UTC
176 points
11 comments7 min readLW link

Sim­plified bio-an­chors for up­per bounds on AI timelines

Fabien Roger15 Jul 2023 18:15 UTC
20 points
4 comments5 min readLW link

Pass­word-locked mod­els: a stress case for ca­pa­bil­ities evaluation

Fabien Roger3 Aug 2023 14:53 UTC
144 points
14 comments6 min readLW link

When AI cri­tique works even with mis­al­igned models

Fabien Roger17 Aug 2023 0:12 UTC
23 points
0 comments2 min readLW link

If in­fluence func­tions are not ap­prox­i­mat­ing leave-one-out, how are they sup­posed to help?

Fabien Roger22 Sep 2023 14:23 UTC
66 points
4 comments3 min readLW link

Will early trans­for­ma­tive AIs pri­mar­ily use text? [Man­i­fold ques­tion]

Fabien Roger2 Oct 2023 15:05 UTC
16 points
0 comments3 min readLW link

Pro­gram­matic back­doors: DNNs can use SGD to run ar­bi­trary state­ful computation

23 Oct 2023 16:37 UTC
101 points
3 comments8 min readLW link

Prevent­ing Lan­guage Models from hid­ing their reasoning

31 Oct 2023 14:34 UTC
107 points
12 comments12 min readLW link

Coup probes: Catch­ing catas­tro­phes with probes trained off-policy

Fabien Roger17 Nov 2023 17:58 UTC
85 points
7 comments14 min readLW link

Some nega­tive steganog­ra­phy results

Fabien Roger9 Dec 2023 20:22 UTC
55 points
5 comments2 min readLW link