RSS

Fabien Roger

Karma: 2,519

Open con­sul­tancy: Let­ting un­trusted AIs choose what an­swer to ar­gue for

Fabien Roger12 Mar 2024 20:38 UTC
35 points
4 comments5 min readLW link

Fa­bien’s Shortform

Fabien Roger5 Mar 2024 18:58 UTC
6 points
26 comments1 min readLW link

Notes on con­trol eval­u­a­tions for safety cases

28 Feb 2024 16:15 UTC
32 points
0 comments32 min readLW link

Pro­to­col eval­u­a­tions: good analo­gies vs control

Fabien Roger19 Feb 2024 18:00 UTC
35 points
10 comments11 min readLW link

Toy mod­els of AI con­trol for con­cen­trated catas­tro­phe prevention

6 Feb 2024 1:38 UTC
50 points
2 comments7 min readLW link

A quick in­ves­ti­ga­tion of AI pro-AI bias

Fabien Roger19 Jan 2024 23:26 UTC
52 points
1 comment2 min readLW link

Mea­sure­ment tam­per­ing de­tec­tion as a spe­cial case of weak-to-strong generalization

23 Dec 2023 0:05 UTC
56 points
10 comments4 min readLW link

Scal­able Over­sight and Weak-to-Strong Gen­er­al­iza­tion: Com­pat­i­ble ap­proaches to the same problem

16 Dec 2023 5:49 UTC
72 points
3 comments6 min readLW link

AI Con­trol: Im­prov­ing Safety De­spite In­ten­tional Subversion

13 Dec 2023 15:51 UTC
196 points
4 comments10 min readLW link

Au­dit­ing failures vs con­cen­trated failures

11 Dec 2023 2:47 UTC
44 points
0 comments7 min readLW link

Some nega­tive steganog­ra­phy results

Fabien Roger9 Dec 2023 20:22 UTC
55 points
5 comments2 min readLW link

Coup probes: Catch­ing catas­tro­phes with probes trained off-policy

Fabien Roger17 Nov 2023 17:58 UTC
85 points
7 comments14 min readLW link

Prevent­ing Lan­guage Models from hid­ing their reasoning

31 Oct 2023 14:34 UTC
107 points
12 comments12 min readLW link

Pro­gram­matic back­doors: DNNs can use SGD to run ar­bi­trary state­ful computation

23 Oct 2023 16:37 UTC
101 points
3 comments8 min readLW link

Will early trans­for­ma­tive AIs pri­mar­ily use text? [Man­i­fold ques­tion]

Fabien Roger2 Oct 2023 15:05 UTC
16 points
0 comments3 min readLW link

If in­fluence func­tions are not ap­prox­i­mat­ing leave-one-out, how are they sup­posed to help?

Fabien Roger22 Sep 2023 14:23 UTC
66 points
4 comments3 min readLW link

Bench­marks for De­tect­ing Mea­sure­ment Tam­per­ing [Red­wood Re­search]

5 Sep 2023 16:44 UTC
84 points
16 comments20 min readLW link
(arxiv.org)

When AI cri­tique works even with mis­al­igned models

Fabien Roger17 Aug 2023 0:12 UTC
23 points
0 comments2 min readLW link

Pass­word-locked mod­els: a stress case for ca­pa­bil­ities evaluation

Fabien Roger3 Aug 2023 14:53 UTC
142 points
14 comments6 min readLW link

Sim­plified bio-an­chors for up­per bounds on AI timelines

Fabien Roger15 Jul 2023 18:15 UTC
20 points
4 comments5 min readLW link