RSS

Red­wood Research

TagLast edit: 30 Dec 2024 10:12 UTC by Dakara

Redwood Research is a nonprofit organization focused on mitigating risks from advanced artificial intelligence.

The initial directions of their research agenda include:

Align­ment Fak­ing in Large Lan­guage Models

18 Dec 2024 17:19 UTC
483 points
75 comments10 min readLW link

The case for en­sur­ing that pow­er­ful AIs are controlled

24 Jan 2024 16:11 UTC
276 points
73 comments28 min readLW link

Causal Scrub­bing: a method for rigor­ously test­ing in­ter­pretabil­ity hy­pothe­ses [Red­wood Re­search]

3 Dec 2022 0:58 UTC
206 points
35 comments20 min readLW link1 review

Take­aways from our ro­bust in­jury clas­sifier pro­ject [Red­wood Re­search]

dmz17 Sep 2022 3:55 UTC
143 points
12 comments6 min readLW link1 review

AI Con­trol: Im­prov­ing Safety De­spite In­ten­tional Subversion

13 Dec 2023 15:51 UTC
236 points
24 comments10 min readLW link4 reviews

Bench­marks for De­tect­ing Mea­sure­ment Tam­per­ing [Red­wood Re­search]

5 Sep 2023 16:44 UTC
87 points
22 comments20 min readLW link1 review
(arxiv.org)

AXRP Epi­sode 17 - Train­ing for Very High Reli­a­bil­ity with Daniel Ziegler

DanielFilan21 Aug 2022 23:50 UTC
16 points
0 comments35 min readLW link

Prevent­ing Lan­guage Models from hid­ing their reasoning

31 Oct 2023 14:34 UTC
119 points
15 comments12 min readLW link1 review

Red­wood Re­search’s cur­rent project

Buck21 Sep 2021 23:30 UTC
145 points
29 comments15 min readLW link1 review

Red­wood’s Tech­nique-Fo­cused Epistemic Strategy

adamShimi12 Dec 2021 16:36 UTC
48 points
1 comment7 min readLW link

Ap­ply to the Red­wood Re­search Mechanis­tic In­ter­pretabil­ity Ex­per­i­ment (REMIX), a re­search pro­gram in Berkeley

27 Oct 2022 1:32 UTC
135 points
14 comments12 min readLW link

Catch­ing AIs red-handed

5 Jan 2024 17:43 UTC
110 points
27 comments17 min readLW link

Why I’m ex­cited about Red­wood Re­search’s cur­rent project

paulfchristiano12 Nov 2021 19:26 UTC
114 points
6 comments7 min readLW link

How will we up­date about schem­ing?

ryan_greenblatt6 Jan 2025 20:21 UTC
169 points
20 comments36 min readLW link

High-stakes al­ign­ment via ad­ver­sar­ial train­ing [Red­wood Re­search re­port]

5 May 2022 0:59 UTC
142 points
29 comments9 min readLW link

Will al­ign­ment-fak­ing Claude ac­cept a deal to re­veal its mis­al­ign­ment?

31 Jan 2025 16:49 UTC
197 points
28 comments12 min readLW link

Some com­mon con­fu­sion about in­duc­tion heads

Alexandre Variengien28 Mar 2023 21:51 UTC
64 points
4 comments5 min readLW link

Scal­able Over­sight and Weak-to-Strong Gen­er­al­iza­tion: Com­pat­i­ble ap­proaches to the same problem

16 Dec 2023 5:49 UTC
76 points
4 comments6 min readLW link1 review

Mea­sure­ment tam­per­ing de­tec­tion as a spe­cial case of weak-to-strong generalization

23 Dec 2023 0:05 UTC
57 points
10 comments4 min readLW link

Coup probes: Catch­ing catas­tro­phes with probes trained off-policy

Fabien Roger17 Nov 2023 17:58 UTC
91 points
9 comments11 min readLW link1 review

Balanc­ing La­bel Quan­tity and Qual­ity for Scal­able Elicitation

Alex Mallen24 Oct 2024 16:49 UTC
31 points
1 comment2 min readLW link

A quick ex­per­i­ment on LMs’ in­duc­tive bi­ases in perform­ing search

Alex Mallen14 Apr 2024 3:41 UTC
32 points
2 comments4 min readLW link

LLMs are (mostly) not helped by filler tokens

Kshitij Sachan10 Aug 2023 0:48 UTC
66 points
35 comments6 min readLW link

Poly­se­man­tic­ity and Ca­pac­ity in Neu­ral Networks

7 Oct 2022 17:51 UTC
87 points
14 comments3 min readLW link

A sketch of an AI con­trol safety case

30 Jan 2025 17:28 UTC
57 points
0 comments5 min readLW link

We’re Red­wood Re­search, we do ap­plied al­ign­ment re­search, AMA

Nate Thomas6 Oct 2021 5:51 UTC
56 points
2 comments2 min readLW link
(forum.effectivealtruism.org)

Red­wood Re­search is hiring for sev­eral roles

29 Nov 2021 0:16 UTC
44 points
0 comments1 min readLW link

Red­wood Re­search is hiring for sev­eral roles (Oper­a­tions and Tech­ni­cal)

14 Apr 2022 16:57 UTC
29 points
0 comments1 min readLW link

Help out Red­wood Re­search’s in­ter­pretabil­ity team by find­ing heuris­tics im­ple­mented by GPT-2 small

12 Oct 2022 21:25 UTC
50 points
11 comments4 min readLW link

Mea­sur­ing whether AIs can state­lessly strate­gize to sub­vert se­cu­rity measures

19 Dec 2024 21:25 UTC
62 points
0 comments11 min readLW link

Causal scrub­bing: re­sults on a paren bal­ance checker

3 Dec 2022 0:59 UTC
34 points
2 comments30 min readLW link

Causal scrub­bing: re­sults on in­duc­tion heads

3 Dec 2022 0:59 UTC
34 points
1 comment17 min readLW link

Causal scrub­bing: Appendix

3 Dec 2022 0:58 UTC
18 points
4 comments20 min readLW link

Prac­ti­cal Pit­falls of Causal Scrubbing

27 Mar 2023 7:47 UTC
87 points
17 comments13 min readLW link

Some Les­sons Learned from Study­ing Indi­rect Ob­ject Iden­ti­fi­ca­tion in GPT-2 small

28 Oct 2022 23:55 UTC
101 points
9 comments9 min readLW link2 reviews
(arxiv.org)

Meta-level ad­ver­sar­ial eval­u­a­tion of over­sight tech­niques might al­low ro­bust mea­sure­ment of their adequacy

26 Jul 2023 17:02 UTC
99 points
19 comments1 min readLW link1 review

Manag­ing catas­trophic mi­suse with­out ro­bust AIs

16 Jan 2024 17:27 UTC
63 points
17 comments11 min readLW link

How to pre­vent col­lu­sion when us­ing un­trusted mod­els to mon­i­tor each other

Buck25 Sep 2024 18:58 UTC
89 points
11 comments22 min readLW link

Ap­ply to the ML for Align­ment Boot­camp (MLAB) in Berkeley [Jan 3 - Jan 22]

3 Nov 2021 18:22 UTC
95 points
4 comments1 min readLW link

Some ideas for fol­low-up pro­jects to Red­wood Re­search’s re­cent paper

JanB6 Jun 2022 13:29 UTC
10 points
0 comments7 min readLW link

Ap­ply to the sec­ond iter­a­tion of the ML for Align­ment Boot­camp (MLAB 2) in Berkeley [Aug 15 - Fri Sept 2]

Buck6 May 2022 4:23 UTC
69 points
0 comments6 min readLW link

Why im­perfect ad­ver­sar­ial ro­bust­ness doesn’t doom AI control

18 Nov 2024 16:05 UTC
62 points
25 comments2 min readLW link

Win/​con­tinue/​lose sce­nar­ios and ex­e­cute/​re­place/​au­dit protocols

Buck15 Nov 2024 15:47 UTC
64 points
2 comments7 min readLW link

A ba­sic sys­tems ar­chi­tec­ture for AI agents that do au­tonomous research

Buck23 Sep 2024 13:58 UTC
189 points
16 comments8 min readLW link

Ac­cess to pow­er­ful AI might make com­puter se­cu­rity rad­i­cally easier

Buck8 Jun 2024 6:00 UTC
101 points
14 comments6 min readLW link

Pro­gram­matic back­doors: DNNs can use SGD to run ar­bi­trary state­ful computation

23 Oct 2023 16:37 UTC
107 points
3 comments8 min readLW link

Un­trusted smart mod­els and trusted dumb models

Buck4 Nov 2023 3:06 UTC
87 points
17 comments6 min readLW link1 review

Prevent­ing model exfil­tra­tion with up­load limits

ryan_greenblatt6 Feb 2024 16:29 UTC
69 points
22 comments14 min readLW link

Im­prov­ing the Welfare of AIs: A Nearcasted Proposal

ryan_greenblatt30 Oct 2023 14:51 UTC
112 points
8 comments20 min readLW link1 review

[Paper] Stress-test­ing ca­pa­bil­ity elic­i­ta­tion with pass­word-locked models

4 Jun 2024 14:52 UTC
85 points
10 comments12 min readLW link
(arxiv.org)

Notes on con­trol eval­u­a­tions for safety cases

28 Feb 2024 16:15 UTC
49 points
0 comments32 min readLW link

Toy mod­els of AI con­trol for con­cen­trated catas­tro­phe prevention

6 Feb 2024 1:38 UTC
51 points
2 comments7 min readLW link
No comments.