ryan_greenblatt

Karma: 6,143

I work at Redwood Research.

The case for ensuring that powerful AIs are controlled

ryan_greenblatt and Buck

24 Jan 2024 16:11 UTC

245 points

66 comments28 min readLW link

How useful is mechanistic interpretability?

ryan_greenblatt, Neel Nanda, Buck and habryka

1 Dec 2023 2:54 UTC

155 points

53 comments25 min readLW link

Improving the Welfare of AIs: A Nearcasted Proposal

ryan_greenblatt30 Oct 2023 14:51 UTC

87 points

5 comments20 min readLW link

Benchmarks for Detecting Measurement Tampering [Redwood Research]

ryan_greenblatt and Fabien Roger

5 Sep 2023 16:44 UTC

85 points

18 comments20 min readLW link

(arxiv.org)

Catching AIs red-handed

ryan_greenblatt and Buck

5 Jan 2024 17:43 UTC

82 points

18 comments17 min readLW link

Two problems with ‘Simulators’ as a frame

ryan_greenblatt17 Feb 2023 23:34 UTC

81 points

13 comments5 min readLW link

Preventing model exfiltration with upload limits

ryan_greenblatt6 Feb 2024 16:29 UTC

63 points

16 comments14 min readLW link

Managing catastrophic misuse without robust AIs

ryan_greenblatt and Buck

16 Jan 2024 17:27 UTC

58 points

16 comments11 min readLW link

Measurement tampering detection as a special case of weak-to-strong generalization

ryan_greenblatt, Fabien Roger and Buck

23 Dec 2023 0:05 UTC

56 points

10 comments4 min readLW link

Auditing failures vs concentrated failures

ryan_greenblatt and Fabien Roger

11 Dec 2023 2:47 UTC

44 points

0 comments7 min readLW link

Notes on control evaluations for safety cases

ryan_greenblatt, Buck and Fabien Roger

28 Feb 2024 16:15 UTC

32 points

0 comments32 min readLW link

Large corporations can unilaterally ban/tax ransomware payments via bets

ryan_greenblatt17 Jul 2021 12:56 UTC

26 points

5 comments2 min readLW link

Researcher incentives cause smoother progress on benchmarks

ryan_greenblatt21 Dec 2021 4:13 UTC

20 points

4 comments1 min readLW link

Framing approaches to alignment and the hard problem of AI cognition

ryan_greenblatt15 Dec 2021 19:06 UTC

16 points

15 comments27 min readLW link

Naive self-supervised approaches to truthful AI

ryan_greenblatt23 Oct 2021 13:03 UTC

9 points

4 comments2 min readLW link

[Question] Questions about multivitamins, especially manganese

ryan_greenblatt19 Jun 2021 16:09 UTC

7 points

8 comments1 min readLW link

Potential gears level explanations of smooth progress

ryan_greenblatt22 Dec 2021 18:05 UTC

4 points

2 comments2 min readLW link