Lee Sharkey

Karma: 2,139

Goodfire (London). Formerly cofounded Apollo Research.

My main research interests are mechanistic interpretability and inner alignment.

[Linkpost] Interpreting Language Model Parameters

Lucius Bushnaq, Dan Braun, Oliver Clive-Griffin, Bart Bussmann, Nathan Hu, mivanitskiy, Linda Linsefors and Lee Sharkey

5 May 2026 17:37 UTC

162 points

2 comments2 min readLW link

(www.goodfire.ai)

[Paper] Stochastic Parameter Decomposition

Lee Sharkey, Lucius Bushnaq and Dan Braun

27 Jun 2025 16:54 UTC

47 points

14 comments1 min readLW link

(arxiv.org)

Lee Sharkey 14 Jun 2025 19:39 UTC
LW: 5 AF: 3
0
AF
in reply to: Adam Shai’s comment on: Mech interp is not pre-paradigmatic
Hey Adam, thanks for your thoughts on this!

I strongly agree with this, and I hope more people in mech. interp. become aware of this. I would actually emphasize that in my opinion it’s not just that it’s hard to do computational science on brains, but that we don’t have the right framework. Some weak evidence for this is exactly that we have an intelligent system that has existed for a few years now where experiments and analyses are easy to do, and we can see how far we’ve gotten with the CNC approach.

I think we’re on the same page that we might not have the right framework to do computational science on brains or other intelligent systems. I think we might disagree on how far away current mainstream ideas are from being the right framework—I’d predict that, if we talked it out further, I’d say we’re closer than you’d say we are. I don’t know how far afield from current ideas that we need to look for the right framework, and I’d support work that looks even further afield than several inferential steps from current mainstream ideas. But I don’t think the historical sluggish pace of computational neuroscience justifies search any particular inferential distance; more proximal solutions feel just as likely to be the next paradigm/wave than more distant solutions (maybe more likely given the social nature of what constitutes a paradigm/wave).
My main point of confusion with this post has to do with Parameter Decomposition as a new paradigm.
I really want to re-emphasize that I didn’t call PD a new paradigm (or even a new ‘wave’) in the post. N.B.: “I’ll emphasize that these are early ideas and certainly do not yet constitute ‘Third-Wave Mech Interp’. ”
I haven’t thought about this technique much, but on a first reading it doesn’t sound all that different from what you call the second wave paradigm, just replacing activations with parameters. For instance, I think I could take most of the last few sections of this post and rewrite it to make the point. Just for fun I’ll try this out here, trying to argue for a new paradigm called “Activation Decomposition”. (just to be super clear I don’t think this is a new paradigm!)
Yeah I don’t think PD throws away the majority of the ideas in the 2nd wave. It’s designed primarily to solve the anomallies of the 2nd wave. It will therefore resemble 2nd wave ideas and we can draw analogies. But I think it’s different in important ways. For one, I think it will probably help us be less confused about ideas like ‘feature’, ‘representation’, ‘circuit’, and so on.

> Some work in this early period was quite neuroscience-adjacent; as a result, despite being extremely mechanistic in flavour, some of this work may have been somewhat overlooked e.g. Sussillo and Barak (2013).
This is a nitpick, and I don’t think any of your main points rests on this, but I think the main reason this work was not used in any type of artificial neural network interp work at that time was that it is fundamentally only applicable to recurrent systems, and probably impossible to apply to e.g. standard convolutional networks. It’s not even straightforward to apply to a lot of the types of recurrent systems used in AI today (to the extent they are even used), but probably one could push on that a bit with some effort.
Yes this is fair. These are still fairly deep neural networks, though (if we count time as depth), and they’re examples of work that interprets ANNs on the lowest level using low-level analysis of weights and activations using e.g. dimensionality reduction and other methods mech interp folks might find familiar. But I agree it doesn’t usually get put in the bucket of ‘mech interp’ though ultimately the boundary is fairly arbitrary. As a separate point, it is surprising how little of the neuroscience community has actually jumped onto mechanistically understanding more interesting models like inception v2 or LLMs despite the similarity of methods and object of study, which is a testament to the early mech interp pioneers since they saw a field where few others did.
As a final question, I am wondering what you think the implications for what people should be doing are if mech interp is or is not pre-paradigmatic? Is there a difference between mech interp being in a not-so-great-paradigm vs. pre-paradigmatic in terms of what your median researcher should be thinking/doing/spending time on? Or is this just an intellectually interesting thing to think about. I am guessing that when a lot of people say that mech interp is pre paradigmatic they really mean something closer to “mech interp doesn’t have a useful/good/perfect paradigm right now”. But I’m also not sure if there’s anything here beyond semantics
I’m not actually sure if this is very action-relevant. I think in the past I might have said “mech interp practitioners should be more familiar with computational neuroscience/connectionism”, since I think this might have save the mech interp community some time. But I don’t think it would have saved a huge amount of time, and I think the mech interp has largely surpassed comp neuro as a source of interesting and relevant ideas. I think it’s mostly useful as an exercise in situating mech interp ideas within the broader set of ideas of an eminently related field (comp neuro/connectionism). But I’ll stress that many in the field see mech interp as better contextualized by other sets of broader ideas (e.g. as a subfield of interpretability/ML), and when viewing mech interp in light of those ideas, it might better be thought of as pre-paradigmatic. I think that’s a completely compatible but different perspective from the one I tend to take, and just emphasizes the subjectiveness of the whole question of whether the field is paradigmatic or not.

Mech interp is not pre-paradigmatic

Lee Sharkey10 Jun 2025 13:39 UTC

213 points

15 comments13 min readLW link

Lee Sharkey 11 Feb 2025 13:28 UTC
4 points
0
in reply to: Danielle Ensign’s comment on: Attribution-based parameter decomposition
Figure 3 in this paper (AtP*, Kramar et al.) illustrates the point nicely: https://arxiv.org/abs/2403.00745

Paper: Open Problems in Mechanistic Interpretability

Lee Sharkey and bilalchughtai

29 Jan 2025 10:25 UTC

71 points

0 comments1 min readLW link

(arxiv.org)

Lee Sharkey 26 Jan 2025 20:56 UTC
LW: 4 AF: 1
0
AF
in reply to: Oliver Daniels’s comment on: Attribution-based parameter decomposition
seems great for mechanistic anomaly detection! very intuitive to map ADP to surprise accounting (I was vaguely trying to get at a method like ADP here)
Agree! I’d be excited by work that uses APD for MAD, or even just work that applies APD to Boolean circuit networks. We did consider using them as a toy model at various points, but ultimately opted to go for other toy models instead.

(btw typo: *APD)

Lee Sharkey 26 Jan 2025 20:56 UTC
LW: 2 AF: 1
1
AF
in reply to: Oliver Daniels’s comment on: Attribution-based parameter decomposition
IMO most exciting mech-interp research since SAEs, great work
I think so too! (assuming it can be made more robust and scaled, which I think it can)
And thanks! :)

Lee Sharkey 26 Jan 2025 18:46 UTC
LW: 4 AF: 3
0
AF
in reply to: Aidan Ewart’s comment on: Attribution-based parameter decomposition
We’re aware of model diffing work like this, but I wasn’t aware of this particular paper.
It’s probably an edge case: They do happen both to be in weight space and to be suggestive of weight space linearity. Indeed, our work was informed by various observations from a range of areas that suggest weight space linearity (some listed here).
On the other hand, our work focused on decomposing a given network’s parameters. But the line of work you linked above seems more in pursuit of model editing and understanding the difference between two similar models, rather than decomposing a particular model’s weights.
all in all, whether it deserved to be in the related work section is unclear to me. Seems plausible either way. The related work section was already pretty long, but it maybe deserves a section on weight space linearity, though probably not one on model diffing imo.

Attribution-based parameter decomposition

Lucius Bushnaq, Dan Braun, StefanHex, jake_mendel and Lee Sharkey

25 Jan 2025 13:12 UTC

109 points

21 comments4 min readLW link

(publications.apolloresearch.ai)

Lee Sharkey 24 Aug 2024 12:29 UTC
LW: 11 AF: 8
0
AF
in reply to: Charlie Steiner’s comment on: Showing SAE Latents Are Not Atomic Using Meta-SAEs
It would be interesting to meditate in the question “What kind of training procedure could you use to get a meta-SAE directly?” And I think answering this relies in part on mathematical specification of what you want.
At Apollo we’re currently working on something that we think will achieve this. Hopefully will have an idea and a few early results (toy models only) to share soon.

Showing SAE Latents Are Not Atomic Using Meta-SAEs

Bart Bussmann, Michael Pearce, Patrick Leask, Joseph Bloom, Lee Sharkey and Neel Nanda

24 Aug 2024 0:56 UTC

73 points

10 comments20 min readLW link

Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs

Kola Ayonrinde, Michael Pearce and Lee Sharkey

23 Aug 2024 18:52 UTC

43 points

8 comments16 min readLW link

Lee Sharkey 11 Aug 2024 13:26 UTC
LW: 2 AF: 1
0
AF
in reply to: Joe Collman’s comment on: Circumventing interpretability: How to defeat mind-readers
So I believe I had in mind “active means [is achieved deliberately through the agent’s actions]”.

I think your distinction makes sense. And if I ever end up updating this article I would consider incorporating it. However, I think the reason I didn’t make this distinction at the time is because the difference is pretty subtle.

The mechanisms I labelled as “strictly active” are the kind of strategy that it would be extremely improbable to implement successfully without some sort of coherent internal representations to help orchestrate the actions required to do it. This is true even if they’ve been selected for passively.

So I’d argue that they all need to be implemented actively (if they’re to have a reasonable probability of success) but may be selected for passively or actively. I’m curious if you agree with this? If not, then I may have missed your argument.

(
For the convenience of other readers of this thread, the kind of strategies I labelled as strictly active are:
- 1.1.3. Environmental artifacts that corrupt its future self: A misaligned AI may leave behind artifacts that serve as adversarial examples or poisoned data that corrupt its future self. (Strictly active)
- 1.1.4. Deleting (and relearning) dangerous representations: When not under surveillance, an AI might be able to delete dangerous representations so that it looks safe when surveillance resumes. It might also be able to schedule when it relearns those representations. (Strictly active)
- 1.1.5. Weights poisoning: Instead of providing manipulated inputs, the AI might manipulate its own weights so that a small subset of normal inputs behave like adversarial examples. (Strictly active)
- 1.1.6. Pre-committing to following potentially dangerous instructions: Hubinger: ”… if the model has the ability to read and write to some sort of state … it could write out a pre-commitment to follow the hard-coded policy and then just have a policy of always following any pre-commitments it finds such that it never thinks about the deception that led it to that policy in the first place.” (Strictly active)
- 1.1.7. Hard-coded neural shortcuts: Hubinger again: “… it could hack its own gradient to produce a new model which will still exhibit the same deceptive behavior, but without doing any of the detectable deceptive cognition. … not because it’s trying to act deceptive, but just because it’s a hard-coded part of its algorithm” (Strictly active)
- 1.1.8. Following cached thoughts: Nanda: “[An AI might develop] the instinct to follow cached thoughts without thinking about them, and in occasional moments of deception to generate a lot of cached thoughts that will lead to misaligned actions, but with no deceptive cognition in the moment.” (Strictly active)
- 1.1.9. Instrumental incapability: A misaligned AI that’s able to modify its own parameters could damage capabilities that serve some goals (e.g. aligned goals) more than its misaligned goals. Generalizes instrumental clumsiness. (Strictly active)
- 1.1.10. Instrumental clumsiness: An AI might hack its future self so that it is more likely to make certain kinds of mistakes while leaving other capabilities intact. (Strictly active)
- 1.3.3. Interpretability illusions: By manipulating the dataset shown to the tool, an AI can control its interpretations. (Strictly active)
- 1.4.1. Adversarial attack against networks in the interpretation tool: The AI could adversarially attack networks in the tool to manipulate the tool’s outputs. (Strictly active)
- 2.4.1. Manipulating, mind-hacking, or incapacitating the interpreter: An AI might be able to manipulate, directly control, or incapacitate humans using specific kinds of outputs. (Strictly active)
)

Lee Sharkey 26 Jul 2024 14:49 UTC
LW: 4 AF: 2
2
AF
on: Pacing Outside the Box: RNNs Learn to Plan in Sokoban
Extremely glad to see this! The Guez et al. model has long struck me as one of the best instances of a mesaoptimizer and it was a real shame that it was closed source. Looking forward to the interp findings!

Lee Sharkey 26 Jul 2024 11:06 UTC
3 points
0
in reply to: phenomanon’s comment on: Efficient Dictionary Learning with Switch Sparse Autoencoders
Both of these seem like interesting directions (I had parameters in mind, but params and activations are too closely linked to ignore one or the other). And I don’t have a super clear idea but something like representational similarity analysis between SwitchSAEs and regular SAEs could be interesting. This is just one possibility of many though. I haven’t thought about it for long enough to be able to list many more, but it feels like a direction with low hanging fruit for sure. For papers, here’s a good place to start for RSA: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3730178/

Lee Sharkey 23 Jul 2024 10:46 UTC
5 points
0
on: Efficient Dictionary Learning with Switch Sparse Autoencoders
Great work! Very excited to see work in this direction (In fact, I didn’t know you were working on this, so I’d expressed enthusiasm for MoE SAEs in our recent list of project ideas published just a few days ago!)

Comments:
- I’d love to see some geometric analysis of the router. Is it just approximately a down-projection from the encoder features learned by a dense SAE trained on the same activations?
- Consider integrating with SAELens.
Following Fedus et al., we route to a single expert SAE. It is possible that selecting several experts will improve performance. The computational cost will scale with the number of experts chosen.
- If there are some very common features in particular layers (e.g. an ‘attend to BOS’ feature), then restricting one expert to be active at a time will potentially force SAEs to learn common features in every expert.

Lee Sharkey 18 Jul 2024 14:42 UTC
LW: 2 AF: 1
0
AF
in reply to: Mateusz Bagiński’s comment on: A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team
Thanks! Fixed now

A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team

Lee Sharkey, Lucius Bushnaq, Dan Braun, StefanHex and Nicholas Goldowsky-Dill

18 Jul 2024 14:15 UTC

125 points

18 comments18 min readLW link

Decomposing the QK circuit with Bilinear Sparse Dictionary Learning

keith_wynroe and Lee Sharkey

2 Jul 2024 13:17 UTC

87 points

7 comments12 min readLW link