Lucius Bushnaq comments on [Linkpost] Interpreting Language Model Parameters

Lucius Bushnaq 5 May 2026 18:40 UTC
11 points
0
Well, it’s not like the method can’t find components that are causally important on many sequence positions. E.g. we show how you can capture the generic QK previous-token behaviour in an attention layer with this using just two rank-1 subcomponents, one in the query matrix and one in the key matrix. And as you might expect from such a generic behaviour, those two are both used pretty much on every token.
I guess if there’s lots of circuits in the same layer that are all used literally on every sequence position of every prompt, this method would have trouble teasing those circuits apart from each other. But as soon as the circuits involved aren’t used basically all of the time on all data, it gets a lot more manageable. Like, practically speaking I don’t know if the method could currently correctly separate two circuits in a model that are both active on of all tokens in the dataset under realistic conditions, but in theory it should be able to. Doing so would lower the training loss.^[1] It’d just be about making the optimisation work well enough.
1. ^
  Unless the circuits are also perfectly correlated in when they’re used.