Lee Sharkey

Karma: 2,125

Goodfire (London). Formerly cofounded Apollo Research.

My main research interests are mechanistic interpretability and inner alignment.

[Linkpost] Interpreting Language Model Parameters

Lucius Bushnaq, Dan Braun, Oliver Clive-Griffin, Bart Bussmann, Nathan Hu, mivanitskiy, Linda Linsefors and Lee Sharkey

5 May 2026 17:37 UTC

159 points

2 comments2 min readLW link

(www.goodfire.ai)

[Paper] Stochastic Parameter Decomposition

Lee Sharkey, Lucius Bushnaq and Dan Braun

27 Jun 2025 16:54 UTC

47 points

14 comments1 min readLW link

(arxiv.org)

Mech interp is not pre-paradigmatic

Lee Sharkey10 Jun 2025 13:39 UTC

213 points

15 comments13 min readLW link

Paper: Open Problems in Mechanistic Interpretability

Lee Sharkey and bilalchughtai

29 Jan 2025 10:25 UTC

71 points

0 comments1 min readLW link

(arxiv.org)

Attribution-based parameter decomposition

Lucius Bushnaq, Dan Braun, StefanHex, jake_mendel and Lee Sharkey

25 Jan 2025 13:12 UTC

108 points

21 comments4 min readLW link

(publications.apolloresearch.ai)

Showing SAE Latents Are Not Atomic Using Meta-SAEs

Bart Bussmann, Michael Pearce, Patrick Leask, Joseph Bloom, Lee Sharkey and Neel Nanda

24 Aug 2024 0:56 UTC

73 points

10 comments20 min readLW link

Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs

Kola Ayonrinde, Michael Pearce and Lee Sharkey

23 Aug 2024 18:52 UTC

43 points

8 comments16 min readLW link

A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team

Lee Sharkey, Lucius Bushnaq, Dan Braun, StefanHex and Nicholas Goldowsky-Dill

18 Jul 2024 14:15 UTC

125 points

18 comments18 min readLW link

Decomposing the QK circuit with Bilinear Sparse Dictionary Learning

keith_wynroe and Lee Sharkey

2 Jul 2024 13:17 UTC

86 points

7 comments12 min readLW link

Apollo Research 1-year update

Marius Hobbhahn, Lee Sharkey, Lucius Bushnaq, Dan Braun, Mikita Balesni, Jérémy Scheurer, Nicholas Goldowsky-Dill, StefanHex, jake_mendel, AlexMeinke and rusheb

29 May 2024 17:44 UTC

93 points

0 comments7 min readLW link

Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning

Dan Braun, Jordan Taylor, Nicholas Goldowsky-Dill and Lee Sharkey

17 May 2024 16:25 UTC

57 points

20 comments4 min readLW link

(arxiv.org)

Gated Attention Blocks: Preliminary Progress toward Removing Attention Head Superposition

cmathw, Dennis Akar and Lee Sharkey

8 Apr 2024 11:14 UTC

42 points

4 comments15 min readLW link

Sparsify: A mechanistic interpretability research agenda

Lee Sharkey3 Apr 2024 12:34 UTC

96 points

23 comments22 min readLW link

Addressing Feature Suppression in SAEs

Benjamin Wright and Lee Sharkey

16 Feb 2024 18:32 UTC

87 points

5 comments10 min readLW link

Theories of Change for AI Auditing

Lee Sharkey, beren and Marius Hobbhahn

13 Nov 2023 19:33 UTC

54 points

0 comments18 min readLW link

(www.apolloresearch.ai)

Announcing Apollo Research

Marius Hobbhahn, beren, Lee Sharkey, Lucius Bushnaq, Dan Braun, Mikita Balesni and Jérémy Scheurer

30 May 2023 16:17 UTC

222 points

11 comments8 min readLW link

‘Fundamental’ vs ‘applied’ mechanistic interpretability research

Lee Sharkey23 May 2023 18:26 UTC

65 points

6 comments3 min readLW link

A technical note on bilinear layers for interpretability

Lee Sharkey8 May 2023 6:06 UTC

59 points

0 comments1 min readLW link

(arxiv.org)

A small update to the Sparse Coding interim research report

Lee Sharkey, Dan Braun and beren

30 Apr 2023 19:54 UTC

61 points

5 comments1 min readLW link

Why almost every RL agent does learned optimization

Lee Sharkey12 Feb 2023 4:58 UTC

32 points

3 comments5 min readLW link