scasper(Stephen Casper)

Karma: 1,565

https://stephencasper.com/

The 6D effect: When companies take risks, one email can be very powerful.

scasper4 Nov 2023 20:08 UTC

261 points

40 comments3 min readLW link

Deep Forgetting & Unlearning for Safely-Scoped LLMs

scasper5 Dec 2023 16:48 UTC

109 points

29 comments13 min readLW link

[Linkpost] A survey on over 300 works about interpretability in deep networks

scasper12 Sep 2022 19:07 UTC

97 points

7 comments2 min readLW link

(arxiv.org)

Takeaways from the Mechanistic Interpretability Challenges

scasper8 Jun 2023 18:56 UTC

93 points

5 comments6 min readLW link

Analogies between scaling labs and misaligned superintelligent AI

scasper21 Feb 2024 19:29 UTC

72 points

4 comments4 min readLW link

Open Problems and Fundamental Limitations of RLHF

scasper31 Jul 2023 15:31 UTC

66 points

6 comments2 min readLW link

(arxiv.org)

EIS V: Blind Spots In AI Safety Interpretability Research

scasper16 Feb 2023 19:09 UTC

54 points

23 comments13 min readLW link

EIS VI: Critiques of Mechanistic Interpretability Work in AI Safety

scasper17 Feb 2023 20:48 UTC

48 points

9 comments12 min readLW link

The Engineer’s Interpretability Sequence (EIS) I: Intro

scasper9 Feb 2023 16:28 UTC

45 points

24 comments3 min readLW link

Eight Strategies for Tackling the Hard Part of the Alignment Problem

scasper8 Jul 2023 18:55 UTC

42 points

11 comments7 min readLW link

Existential AI Safety is NOT separate from near-term applications

scasper13 Dec 2022 14:47 UTC

37 points

17 comments3 min readLW link

EIS VII: A Challenge for Mechanists

scasper18 Feb 2023 18:27 UTC

35 points

4 comments3 min readLW link

scasper 3 Jan 2023 20:01 UTC
33 points
13
on: How to eat potato chips while typing
There seems to be high variance in the scope of the challenges that Katja has been tackling recently.

Dissolving Confusion around Functional Decision Theory

scasper5 Jan 2020 6:38 UTC

32 points

24 comments9 min readLW link

Where to be an AI Safety Professor

scasper7 Dec 2022 7:09 UTC

30 points

12 comments2 min readLW link

EIS IX: Interpretability and Adversaries

scasper20 Feb 2023 18:25 UTC

30 points

7 comments8 min readLW link

Deep Dives: My Advice for Pursuing Work in Research

scasper11 Mar 2022 17:56 UTC

29 points

2 comments3 min readLW link

EIS II: What is “Interpretability”?

scasper9 Feb 2023 16:48 UTC

28 points

6 comments4 min readLW link

scasper 18 Aug 2023 16:14 UTC
27 points
26
in reply to: Richard_Ngo’s comment on: Against Almost Every Theory of Impact of Interpretability
I get the impression of a certain of motte and bailey in this comment and similar arguments. From a high-level, the notion of better understanding what neural networks are doing would be great. The problem though seems to be that most of the SOTA of research in interpretability does not seem to be doing a good job of this in a way that seems useful for safety anytime soon. In that sense, I think this comment talks past the points that this post is trying to make.

scasper 12 Sep 2022 19:35 UTC
LW: 26 AF: 13
0
AF
in reply to: Scott Emmons’s comment on: [Linkpost] A survey on over 300 works about interpretability in deep networks
My answer to this is actually tucked into one paragraph on the 10th page of the paper: “This type of approach is valuable...reverse engineering a system”. We cite examples of papers that have used interpretability tools to generate novel adversaries, aid in manually-finetuning a network to induce a predictable change, or reverse engineer a network. Here they are.
Making adversaries:
https://distill.pub/2019/activation-atlas/
https://arxiv.org/abs/2110.03605
https://arxiv.org/abs/1811.12231
https://arxiv.org/abs/2201.11114
https://arxiv.org/abs/2206.14754
https://arxiv.org/abs/2106.03805
https://arxiv.org/abs/2006.14032
https://arxiv.org/abs/2208.08831
https://arxiv.org/abs/2205.01663
Manual fine-tuning:
https://arxiv.org/abs/2202.05262
https://arxiv.org/abs/2105.04857
Reverse engineering (I’d put an asterisk on these ones though because I don’t expect methods like this to scale well to non-toy problems):
https://www.lesswrong.com/posts/N6WM6hs7RQMKDhYjB/a-mechanistic-interpretability-analysis-of-grokking
https://distill.pub/2020/circuits/curve-detectors/