Fabien Roger comments on [Linkpost] Interpreting Language Model Parameters

Fabien Roger 5 May 2026 18:27 UTC
LW: 7 AF: 5
0
AF
Cool work!
My current understanding is that this part of the tech tree centrally relies on some kind of token-wise sparsity—the components are built such that they have an impact only on a small subset of tokens.
1. Is that correct? Or does it play a sufficiently small role that it could be ablated away?
2. Is that bad? My guess is that the most important circuits to interpret are useful on many tokens. Circuits related to a deceptive character trying to not get caught or consequentialist reasoning to steer the story towards a happy outcome may be slightly sparse in pretraining (though maybe more like 1/10k tokens than 1/10M tokens), but active much more frequently in post-training if they were powering scary deception.
3. Is there a way out? It’s not clear to me there is one because things don’t seem very sparse in Transformers, but I would also not be surprised if there were satisfying sparse human understandable full explanations of small Transformers because the reality that Transformers are trying to capture is probably explainable in sparse terms.
- Lucius Bushnaq 5 May 2026 18:40 UTC
  8 points
  0
  Parent
  Well, it’s not like the method can’t find components that are causally important on many sequence positions. E.g. we show how you can capture the generic QK previous-token behaviour in an attention layer with this using just two rank-1 subcomponents, one in the query matrix and one in the key matrix. And as you might expect from such a generic behaviour, those two are both used pretty much on every token.
  I guess if there’s lots of circuits in the same layer that are all used literally on every sequence position of every prompt, this method would have trouble teasing those circuits apart from each other. But as soon as the circuits involved aren’t used basically all of the time on all data, it gets a lot more manageable. Like, practically speaking I don’t know if the method could currently correctly separate two circuits in a model that are both active on of all tokens in the dataset under realistic conditions, but in theory it should be able to. Doing so would lower the training loss.^[1] It’d just be about making the optimisation work well enough.
  1. ^
    Unless the circuits are also perfectly correlated in when they’re used.