RSS

ryan_greenblatt

Karma: 6,265

I work at Redwood Research.

How use­ful is “AI Con­trol” as a fram­ing on AI X-Risk?

14 Mar 2024 18:06 UTC
67 points
4 comments34 min readLW link

Notes on con­trol eval­u­a­tions for safety cases

28 Feb 2024 16:15 UTC
32 points
0 comments32 min readLW link

Prevent­ing model exfil­tra­tion with up­load limits

ryan_greenblatt6 Feb 2024 16:29 UTC
63 points
16 comments14 min readLW link

The case for en­sur­ing that pow­er­ful AIs are controlled

24 Jan 2024 16:11 UTC
245 points
66 comments28 min readLW link

Manag­ing catas­trophic mi­suse with­out ro­bust AIs

16 Jan 2024 17:27 UTC
58 points
16 comments11 min readLW link

Catch­ing AIs red-handed

5 Jan 2024 17:43 UTC
82 points
18 comments17 min readLW link

Mea­sure­ment tam­per­ing de­tec­tion as a spe­cial case of weak-to-strong generalization

23 Dec 2023 0:05 UTC
56 points
10 comments4 min readLW link

Scal­able Over­sight and Weak-to-Strong Gen­er­al­iza­tion: Com­pat­i­ble ap­proaches to the same problem

16 Dec 2023 5:49 UTC
73 points
3 comments6 min readLW link

AI Con­trol: Im­prov­ing Safety De­spite In­ten­tional Subversion

13 Dec 2023 15:51 UTC
197 points
7 comments10 min readLW link

Au­dit­ing failures vs con­cen­trated failures

11 Dec 2023 2:47 UTC
44 points
0 comments7 min readLW link

How use­ful is mechanis­tic in­ter­pretabil­ity?

1 Dec 2023 2:54 UTC
156 points
53 comments25 min readLW link

Prevent­ing Lan­guage Models from hid­ing their reasoning

31 Oct 2023 14:34 UTC
107 points
12 comments12 min readLW link

ryan_green­blatt’s Shortform

ryan_greenblatt30 Oct 2023 16:51 UTC
6 points
33 comments1 min readLW link

Im­prov­ing the Welfare of AIs: A Nearcasted Proposal

ryan_greenblatt30 Oct 2023 14:51 UTC
87 points
5 comments20 min readLW link

What’s up with “Re­spon­si­ble Scal­ing Poli­cies”?

29 Oct 2023 4:17 UTC
99 points
8 comments20 min readLW link

Bench­marks for De­tect­ing Mea­sure­ment Tam­per­ing [Red­wood Re­search]

5 Sep 2023 16:44 UTC
84 points
18 comments20 min readLW link
(arxiv.org)

Meta-level ad­ver­sar­ial eval­u­a­tion of over­sight tech­niques might al­low ro­bust mea­sure­ment of their adequacy

26 Jul 2023 17:02 UTC
83 points
18 comments1 min readLW link

Two prob­lems with ‘Si­mu­la­tors’ as a frame

ryan_greenblatt17 Feb 2023 23:34 UTC
81 points
13 comments5 min readLW link

Causal scrub­bing: re­sults on in­duc­tion heads

3 Dec 2022 0:59 UTC
34 points
1 comment17 min readLW link

Causal scrub­bing: re­sults on a paren bal­ance checker

3 Dec 2022 0:59 UTC
34 points
2 comments30 min readLW link