Archive
Sequences
About
Search
Log In
Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
RSS
ryan_greenblatt
Karma:
6,265
I work at Redwood Research.
All
Posts
Comments
New
Top
Old
Page
1
How useful is “AI Control” as a framing on AI X-Risk?
habryka
and
ryan_greenblatt
14 Mar 2024 18:06 UTC
67
points
4
comments
34
min read
LW
link
Notes on control evaluations for safety cases
ryan_greenblatt
,
Buck
and
Fabien Roger
28 Feb 2024 16:15 UTC
32
points
0
comments
32
min read
LW
link
Preventing model exfiltration with upload limits
ryan_greenblatt
6 Feb 2024 16:29 UTC
63
points
16
comments
14
min read
LW
link
The case for ensuring that powerful AIs are controlled
ryan_greenblatt
and
Buck
24 Jan 2024 16:11 UTC
245
points
66
comments
28
min read
LW
link
Managing catastrophic misuse without robust AIs
ryan_greenblatt
and
Buck
16 Jan 2024 17:27 UTC
58
points
16
comments
11
min read
LW
link
Catching AIs red-handed
ryan_greenblatt
and
Buck
5 Jan 2024 17:43 UTC
82
points
18
comments
17
min read
LW
link
Measurement tampering detection as a special case of weak-to-strong generalization
ryan_greenblatt
,
Fabien Roger
and
Buck
23 Dec 2023 0:05 UTC
56
points
10
comments
4
min read
LW
link
Scalable Oversight and Weak-to-Strong Generalization: Compatible approaches to the same problem
Ansh Radhakrishnan
,
Buck
,
ryan_greenblatt
and
Fabien Roger
16 Dec 2023 5:49 UTC
73
points
3
comments
6
min read
LW
link
AI Control: Improving Safety Despite Intentional Subversion
Buck
,
Fabien Roger
,
ryan_greenblatt
and
Kshitij Sachan
13 Dec 2023 15:51 UTC
197
points
7
comments
10
min read
LW
link
Auditing failures vs concentrated failures
ryan_greenblatt
and
Fabien Roger
11 Dec 2023 2:47 UTC
44
points
0
comments
7
min read
LW
link
How useful is mechanistic interpretability?
ryan_greenblatt
,
Neel Nanda
,
Buck
and
habryka
1 Dec 2023 2:54 UTC
156
points
53
comments
25
min read
LW
link
Preventing Language Models from hiding their reasoning
Fabien Roger
and
ryan_greenblatt
31 Oct 2023 14:34 UTC
107
points
12
comments
12
min read
LW
link
ryan_greenblatt’s Shortform
ryan_greenblatt
30 Oct 2023 16:51 UTC
6
points
33
comments
1
min read
LW
link
Improving the Welfare of AIs: A Nearcasted Proposal
ryan_greenblatt
30 Oct 2023 14:51 UTC
87
points
5
comments
20
min read
LW
link
What’s up with “Responsible Scaling Policies”?
habryka
and
ryan_greenblatt
29 Oct 2023 4:17 UTC
99
points
8
comments
20
min read
LW
link
Benchmarks for Detecting Measurement Tampering [Redwood Research]
ryan_greenblatt
and
Fabien Roger
5 Sep 2023 16:44 UTC
84
points
18
comments
20
min read
LW
link
(arxiv.org)
Meta-level adversarial evaluation of oversight techniques might allow robust measurement of their adequacy
Buck
and
ryan_greenblatt
26 Jul 2023 17:02 UTC
83
points
18
comments
1
min read
LW
link
Two problems with ‘Simulators’ as a frame
ryan_greenblatt
17 Feb 2023 23:34 UTC
81
points
13
comments
5
min read
LW
link
Causal scrubbing: results on induction heads
LawrenceC
,
Adrià Garriga-alonso
,
Nicholas Goldowsky-Dill
,
ryan_greenblatt
,
Tao Lin
,
jenny
,
Ansh Radhakrishnan
,
Buck
and
Nate Thomas
3 Dec 2022 0:59 UTC
34
points
1
comment
17
min read
LW
link
Causal scrubbing: results on a paren balance checker
LawrenceC
,
Adrià Garriga-alonso
,
Nicholas Goldowsky-Dill
,
ryan_greenblatt
,
Tao Lin
,
jenny
,
Ansh Radhakrishnan
,
Buck
and
Nate Thomas
3 Dec 2022 0:59 UTC
34
points
2
comments
30
min read
LW
link
Back to top
Next