RSS

Fabien Roger

Karma: 2,528

LLMs Some­times Gen­er­ate Purely Nega­tively-Re­in­forced Text

Fabien Roger16 Jun 2023 16:31 UTC
176 points
11 comments7 min readLW link

What Dis­cov­er­ing La­tent Knowl­edge Did and Did Not Find

Fabien Roger13 Mar 2023 19:29 UTC
164 points
16 comments11 min readLW link

Pass­word-locked mod­els: a stress case for ca­pa­bil­ities evaluation

Fabien Roger3 Aug 2023 14:53 UTC
144 points
14 comments6 min readLW link

The Translu­cent Thoughts Hy­pothe­ses and Their Implications

Fabien Roger9 Mar 2023 16:30 UTC
133 points
7 comments19 min readLW link

Prevent­ing Lan­guage Models from hid­ing their reasoning

31 Oct 2023 14:34 UTC
107 points
12 comments12 min readLW link

Pro­gram­matic back­doors: DNNs can use SGD to run ar­bi­trary state­ful computation

23 Oct 2023 16:37 UTC
101 points
3 comments8 min readLW link

By De­fault, GPTs Think In Plain Sight

Fabien Roger19 Nov 2022 19:15 UTC
85 points
33 comments9 min readLW link

Coup probes: Catch­ing catas­tro­phes with probes trained off-policy

Fabien Roger17 Nov 2023 17:58 UTC
85 points
7 comments14 min readLW link

If in­fluence func­tions are not ap­prox­i­mat­ing leave-one-out, how are they sup­posed to help?

Fabien Roger22 Sep 2023 14:23 UTC
66 points
4 comments3 min readLW link

Some nega­tive steganog­ra­phy results

Fabien Roger9 Dec 2023 20:22 UTC
55 points
5 comments2 min readLW link

A quick in­ves­ti­ga­tion of AI pro-AI bias

Fabien Roger19 Jan 2024 23:26 UTC
52 points
1 comment2 min readLW link

Toy mod­els of AI con­trol for con­cen­trated catas­tro­phe prevention

6 Feb 2024 1:38 UTC
50 points
2 comments7 min readLW link

A Mys­tery About High Di­men­sional Con­cept Encoding

Fabien Roger3 Nov 2022 17:05 UTC
46 points
13 comments7 min readLW link

Some ML-Re­lated Math I Now Un­der­stand Better

Fabien Roger9 Mar 2023 16:35 UTC
45 points
4 comments4 min readLW link

Open con­sul­tancy: Let­ting un­trusted AIs choose what an­swer to ar­gue for

Fabien Roger12 Mar 2024 20:38 UTC
35 points
4 comments5 min readLW link

Pro­to­col eval­u­a­tions: good analo­gies vs control

Fabien Roger19 Feb 2024 18:00 UTC
35 points
10 comments11 min readLW link