ryan_greenblatt

Karma: 6,265

I work at Redwood Research.

How useful is “AI Control” as a framing on AI X-Risk?

habryka and ryan_greenblatt

14 Mar 2024 18:06 UTC

67 points

4 comments34 min readLW link

Notes on control evaluations for safety cases

ryan_greenblatt, Buck and Fabien Roger

28 Feb 2024 16:15 UTC

32 points

0 comments32 min readLW link

Preventing model exfiltration with upload limits

ryan_greenblatt6 Feb 2024 16:29 UTC

63 points

16 comments14 min readLW link

The case for ensuring that powerful AIs are controlled

ryan_greenblatt and Buck

24 Jan 2024 16:11 UTC

245 points

66 comments28 min readLW link

Managing catastrophic misuse without robust AIs

ryan_greenblatt and Buck

16 Jan 2024 17:27 UTC

58 points

16 comments11 min readLW link

Catching AIs red-handed

ryan_greenblatt and Buck

5 Jan 2024 17:43 UTC

82 points

18 comments17 min readLW link

Measurement tampering detection as a special case of weak-to-strong generalization

ryan_greenblatt, Fabien Roger and Buck

23 Dec 2023 0:05 UTC

56 points

10 comments4 min readLW link

Scalable Oversight and Weak-to-Strong Generalization: Compatible approaches to the same problem

Ansh Radhakrishnan, Buck, ryan_greenblatt and Fabien Roger

16 Dec 2023 5:49 UTC

73 points

3 comments6 min readLW link

AI Control: Improving Safety Despite Intentional Subversion

Buck, Fabien Roger, ryan_greenblatt and Kshitij Sachan

13 Dec 2023 15:51 UTC

197 points

7 comments10 min readLW link

Auditing failures vs concentrated failures

ryan_greenblatt and Fabien Roger

11 Dec 2023 2:47 UTC

44 points

0 comments7 min readLW link

How useful is mechanistic interpretability?

ryan_greenblatt, Neel Nanda, Buck and habryka

1 Dec 2023 2:54 UTC

156 points

53 comments25 min readLW link

Preventing Language Models from hiding their reasoning

Fabien Roger and ryan_greenblatt

31 Oct 2023 14:34 UTC

107 points

12 comments12 min readLW link

ryan_greenblatt’s Shortform

ryan_greenblatt30 Oct 2023 16:51 UTC

6 points

33 comments1 min readLW link

Improving the Welfare of AIs: A Nearcasted Proposal

ryan_greenblatt30 Oct 2023 14:51 UTC

87 points

5 comments20 min readLW link

What’s up with “Responsible Scaling Policies”?

habryka and ryan_greenblatt

29 Oct 2023 4:17 UTC

99 points

8 comments20 min readLW link

Benchmarks for Detecting Measurement Tampering [Redwood Research]

ryan_greenblatt and Fabien Roger

5 Sep 2023 16:44 UTC

84 points

18 comments20 min readLW link

(arxiv.org)

Meta-level adversarial evaluation of oversight techniques might allow robust measurement of their adequacy

Buck and ryan_greenblatt

26 Jul 2023 17:02 UTC

83 points

18 comments1 min readLW link

Two problems with ‘Simulators’ as a frame

ryan_greenblatt17 Feb 2023 23:34 UTC

81 points

13 comments5 min readLW link

Causal scrubbing: results on induction heads

LawrenceC, Adrià Garriga-alonso, Nicholas Goldowsky-Dill, ryan_greenblatt, Tao Lin, jenny, Ansh Radhakrishnan, Buck and Nate Thomas

3 Dec 2022 0:59 UTC

34 points

1 comment17 min readLW link

Causal scrubbing: results on a paren balance checker

LawrenceC, Adrià Garriga-alonso, Nicholas Goldowsky-Dill, ryan_greenblatt, Tao Lin, jenny, Ansh Radhakrishnan, Buck and Nate Thomas

3 Dec 2022 0:59 UTC

34 points

2 comments30 min readLW link