A core assumption of linear parameter decomposition methods (APD, SPD) is weight linearity. The methods attempt to decompose a neural network parameter vector into a sum of components θ=∑cθc such that each component is sufficient to execute the mechanism it implements.[1] That this is possible is a crucial and unusual assumption. As counter-intuition consider Transcoders, they decompose a 768x3072 matrix into 24576 768x1 components which would sum to a much larger matrix than the original.[2]
Trivial example where weight linearity does not hold: Consider the matrix M=(5005) in a network that uses superposition to represent 3 features in two dimensions. A sensible decomposition could be to represent the matrix as the sum of 3 rank-one components
^v1=(10),^v2=(−0.50.866),^v3=(−0.5−0.866).
If we do this though, we see that the components sum to more than the original matrix
The decomposition doesn’t work, and I can’t find any other decomposition that makes sense. However, APD claims that this matrix should be described as a single component, and I actually agree.[3]
Trivial examples where weight linearity does hold: In the SPD/APD papers we have two models where weight linearity holds: The Toy Model of Superposition, and a hand-coded Piecewise-Linear network. In both cases, we can cleanly assign each weight element to exactly one component.
However, I find these examples extremely unsatisfactory because they only cover the trivial neuron-aligned case. When each neuron is dedicated to exactly one component (monosemantic), parameter decomposition is trivial. In realistic models, we strongly expect neurons to not be monosemantic (superposition, computation in superposition), and we don’t know whether weight linearity holds in those cases.
Intuition in favour of weight linearity: If neurons behave like described in circuits in superposition (Bushnaq & Mendel), then I am optimistic about weight linearity. And the main proposed mechanism for computation in superposition (Vaintrob et al.) works like this too. But we have no trained models that we know to behave this way.[4]
Intuition against weight linearity: Think of a general arrangement of multiple inputs feeding into one ReLU neuron. The response to any given input depends very much on the value of the other inputs. Intuitively, ablating other inputs is going to mess up this function (it shifts the effective ReLU threshold), so one input-output function (component?) cannot work independently of the others. Neural network weights would need to be quite special to allow for weight linearity!
I’m genuinely unsure what the correct answer is. I’d love to see project (ideas) for testing this assumption!
Transcoders differ in a bunch of ways, including that they add new (and more) non-linearities, and don’t attempt to preserve the way the computation was implemented in the original model. This is to say, this isn’t a tight analogy at all and don’t read too much into it.
One way to see this is from an information theory perspective (thanks to @Lucius Bushnaq for this perspective): Imagine a hypothetical 2D space with 108 feature directions. Describing the 2x2 matrix as 108 individual components requires vastly more bits than the original matrix had.
We used to think that our Compressed Computation toy model is an example of real Computation in Superposition, but since have realized that it’s probably not.
@Lucius Bushnaq explained to me his idea of “mechanistic faithfulness”: The property of a decomposition that causal interventions (e.g. ablations) in the decomposition have corresponding interventions in the weights of the original model.[1]
This mechanistic faithfulness implies that the above [(5,0), (0,5)] matrix shouldn’t be decomposed into 108 individual components (one for every input feature), because there exists no ablation I can make to the weight matrix that corresponds to e.g. ablating just one of the 108 components.
Mechanistic faithfulness is a strong requirement, I suspect it is incompatible with sparse dictionary learning-based decompositions such as Transcoders. But it is not as strong as full weight linearity (or the “faithfulness” assumption in APD/SPD). To see that, consider a network with three mechanisms A, B, and C. Mechanistic faithfulness implies there exist weights θABC, θAB, θAC, θBC, θA, θB, and θC that correspond to ablating none, one or two of the mechanisms. Weight linearity additionally assumes that θABC=θAB+θC=θA+θB+θC etc.
Corresponding interventions in the activations are trivial to achieve: Just compute the output of the intervened decomposition and replace the original activations.
Is weight linearity real?
A core assumption of linear parameter decomposition methods (APD, SPD) is weight linearity. The methods attempt to decompose a neural network parameter vector into a sum of components θ=∑cθc such that each component is sufficient to execute the mechanism it implements.[1] That this is possible is a crucial and unusual assumption. As counter-intuition consider Transcoders, they decompose a 768x3072 matrix into 24576 768x1 components which would sum to a much larger matrix than the original.[2]
Trivial example where weight linearity does not hold: Consider the matrix M=(5005) in a network that uses superposition to represent 3 features in two dimensions. A sensible decomposition could be to represent the matrix as the sum of 3 rank-one components
^v1=(10),^v2=(−0.50.866),^v3=(−0.5−0.866).If we do this though, we see that the components sum to more than the original matrix
5^v1^v⊤1+5^v2^v⊤2+5^v3^v⊤3=(5005)+(1.25−2.166−2.1663.75)+(1.252.1662.1663.75)=(7.5007.5).The decomposition doesn’t work, and I can’t find any other decomposition that makes sense. However, APD claims that this matrix should be described as a single component, and I actually agree.[3]
Trivial examples where weight linearity does hold: In the SPD/APD papers we have two models where weight linearity holds: The Toy Model of Superposition, and a hand-coded Piecewise-Linear network. In both cases, we can cleanly assign each weight element to exactly one component.
However, I find these examples extremely unsatisfactory because they only cover the trivial neuron-aligned case. When each neuron is dedicated to exactly one component (monosemantic), parameter decomposition is trivial. In realistic models, we strongly expect neurons to not be monosemantic (superposition, computation in superposition), and we don’t know whether weight linearity holds in those cases.
Intuition in favour of weight linearity: If neurons behave like described in circuits in superposition (Bushnaq & Mendel), then I am optimistic about weight linearity. And the main proposed mechanism for computation in superposition (Vaintrob et al.) works like this too. But we have no trained models that we know to behave this way.[4]
Intuition against weight linearity: Think of a general arrangement of multiple inputs feeding into one ReLU neuron. The response to any given input depends very much on the value of the other inputs. Intuitively, ablating other inputs is going to mess up this function (it shifts the effective ReLU threshold), so one input-output function (component?) cannot work independently of the others. Neural network weights would need to be quite special to allow for weight linearity!
I’m genuinely unsure what the correct answer is. I’d love to see project (ideas) for testing this assumption!
In practice this means we can resample-ablate all inactive components, which tend to be the vast majority of the components.
Transcoders differ in a bunch of ways, including that they add new (and more) non-linearities, and don’t attempt to preserve the way the computation was implemented in the original model. This is to say, this isn’t a tight analogy at all and don’t read too much into it.
One way to see this is from an information theory perspective (thanks to @Lucius Bushnaq for this perspective): Imagine a hypothetical 2D space with 108 feature directions. Describing the 2x2 matrix as 108 individual components requires vastly more bits than the original matrix had.
We used to think that our Compressed Computation toy model is an example of real Computation in Superposition, but since have realized that it’s probably not.
@Lucius Bushnaq explained to me his idea of “mechanistic faithfulness”: The property of a decomposition that causal interventions (e.g. ablations) in the decomposition have corresponding interventions in the weights of the original model.[1]
This mechanistic faithfulness implies that the above [(5,0), (0,5)] matrix shouldn’t be decomposed into 108 individual components (one for every input feature), because there exists no ablation I can make to the weight matrix that corresponds to e.g. ablating just one of the 108 components.
Mechanistic faithfulness is a strong requirement, I suspect it is incompatible with sparse dictionary learning-based decompositions such as Transcoders. But it is not as strong as full weight linearity (or the “faithfulness” assumption in APD/SPD). To see that, consider a network with three mechanisms A, B, and C. Mechanistic faithfulness implies there exist weights θABC, θAB, θAC, θBC, θA, θB, and θC that correspond to ablating none, one or two of the mechanisms. Weight linearity additionally assumes that θABC=θAB+θC=θA+θB+θC etc.
Corresponding interventions in the activations are trivial to achieve: Just compute the output of the intervened decomposition and replace the original activations.