RSS

ryan_greenblatt

Karma: 6,038

I work at Redwood Research.

The case for en­sur­ing that pow­er­ful AIs are controlled

24 Jan 2024 16:11 UTC
243 points
66 comments28 min readLW link

How use­ful is mechanis­tic in­ter­pretabil­ity?

1 Dec 2023 2:54 UTC
155 points
53 comments25 min readLW link

Im­prov­ing the Welfare of AIs: A Nearcasted Proposal

ryan_greenblatt30 Oct 2023 14:51 UTC
87 points
5 comments20 min readLW link

Bench­marks for De­tect­ing Mea­sure­ment Tam­per­ing [Red­wood Re­search]

5 Sep 2023 16:44 UTC
84 points
14 comments20 min readLW link
(arxiv.org)

Catch­ing AIs red-handed

5 Jan 2024 17:43 UTC
82 points
18 comments17 min readLW link

Two prob­lems with ‘Si­mu­la­tors’ as a frame

ryan_greenblatt17 Feb 2023 23:34 UTC
81 points
13 comments5 min readLW link

Prevent­ing model exfil­tra­tion with up­load limits

ryan_greenblatt6 Feb 2024 16:29 UTC
63 points
15 comments14 min readLW link

Manag­ing catas­trophic mi­suse with­out ro­bust AIs

16 Jan 2024 17:27 UTC
58 points
16 comments11 min readLW link

Mea­sure­ment tam­per­ing de­tec­tion as a spe­cial case of weak-to-strong generalization

23 Dec 2023 0:05 UTC
56 points
10 comments4 min readLW link