I think it’s the other way around. If you try to implement computation in superposition in a network with a residual stream, you will find that about the best thing you can do with the Wout is often to just use it as a global linear transformation. Most other things you might try to do with it drastically increases noise for not much pay-off. In the cases where networks are doing that, I would want SPD to show us this global linear transform.
What do you mean by “a global linear transformation” as in what kinds of linear transformations are there other than this? If we have an MLP consisting of multiple computations going on in superposition (your sense) I would hope that the W_in would be decomposed into co-activating subcomponents corresponding to features being read into computations, and the W_out would also be decomposed into co-activating subcomponents corresponding to the outputs of those computations being read back into the residual stream. The fact that this doesn’t happen tells me something is wrong.
As to the issue with the maximum number of components: it seems to me like if you have five sparse features (in something like the SAE sense) in superposition and you apply a rotation (or reflection, or identity transformation) then the important information would be contained in a set of five rank 1 transformations, basically a set of maps from A to B. This doesn’t happen for the identity, does it happen for a rotation or reflection?
Finally, as to “introducing noise” by doing things other than a global linear transformation, where have you seen evidence for this? On synthetic (and thus clean) datasets, or actually in real datasets? In real scenarios, your model will (I strongly believe) be set up such that the “noise” between interfering features is actually helpful for model performance, since the world has lots of structure which can be captured in the particular permutation in which you embed your overcomplete feature set into a lower dimensional space.
What do you mean by “a global linear transformation” as in what kinds of linear transformations are there other than this? If we have an MLP consisting of multiple computations going on in superposition (your sense) I would hope that the W_in would be decomposed into co-activating subcomponents corresponding to features being read into computations, and the W_out would also be decomposed into co-activating subcomponents corresponding to the outputs of those computations being read back into the residual stream. The fact that this doesn’t happen tells me something is wrong.
Linear transformations that are the sum of weights for different circuits in superposition, for example.
What I am trying to say is that I expect networks to implement computation in superposition by linearly adding many different subcomponents to create W_in, but I mostly do not expect networks to create W_out by linearly adding many different subcomponents that each read-out a particular circuit output back into the residual stream, because that’s actually an incredibly noisy operation. I made this mistake at first as well. This post still has a faulty construction for W_out because of my error. Linda Linsefors finally corrected me on this a couple months ago.
As to the issue with the maximum number of components: it seems to me like if you have five sparse features (in something like the SAE sense) in superposition and you apply a rotation (or reflection, or identity transformation) then the important information would be contained in a set of five rank 1 transformations, basically a set of maps from A to B. This doesn’t happen for the identity, does it happen for a rotation or reflection?
I disagree that if all we’re doing is applying a linear transformation to the entire space of superposed features, rather than, say, performing different computations on the five different features, that it would be desirable to split this linear transformation into the five features.
Finally, as to “introducing noise” by doing things other than a global linear transformation, where have you seen evidence for this? On synthetic (and thus clean) datasets, or actually in real datasets? In real scenarios, your model will (I strongly believe) be set up such that the “noise” between interfering features is actually helpful for model performance, since the world has lots of structure which can be captured in the particular permutation in which you embed your overcomplete feature set into a lower dimensional space.
Uh, I think this would be a longer discussion than I feel up for at the moment, but I disagree with your prediction. I agree that the representational geometry in the model will be important and that it will be set up to help the model, but interference of circuits in superposition cannot be arranged to be helpful in full generality. If it were, I would take that as pretty strong evidence that whatever is going on in the model is not well-described by the framework of superposition at all.
What do you mean by “a global linear transformation” as in what kinds of linear transformations are there other than this? If we have an MLP consisting of multiple computations going on in superposition (your sense) I would hope that the W_in would be decomposed into co-activating subcomponents corresponding to features being read into computations, and the W_out would also be decomposed into co-activating subcomponents corresponding to the outputs of those computations being read back into the residual stream. The fact that this doesn’t happen tells me something is wrong.
As to the issue with the maximum number of components: it seems to me like if you have five sparse features (in something like the SAE sense) in superposition and you apply a rotation (or reflection, or identity transformation) then the important information would be contained in a set of five rank 1 transformations, basically a set of maps from A to B. This doesn’t happen for the identity, does it happen for a rotation or reflection?
Finally, as to “introducing noise” by doing things other than a global linear transformation, where have you seen evidence for this? On synthetic (and thus clean) datasets, or actually in real datasets? In real scenarios, your model will (I strongly believe) be set up such that the “noise” between interfering features is actually helpful for model performance, since the world has lots of structure which can be captured in the particular permutation in which you embed your overcomplete feature set into a lower dimensional space.
Linear transformations that are the sum of weights for different circuits in superposition, for example.
What I am trying to say is that I expect networks to implement computation in superposition by linearly adding many different subcomponents to create W_in, but I mostly do not expect networks to create W_out by linearly adding many different subcomponents that each read-out a particular circuit output back into the residual stream, because that’s actually an incredibly noisy operation. I made this mistake at first as well. This post still has a faulty construction for W_out because of my error. Linda Linsefors finally corrected me on this a couple months ago.
I disagree that if all we’re doing is applying a linear transformation to the entire space of superposed features, rather than, say, performing different computations on the five different features, that it would be desirable to split this linear transformation into the five features.
Uh, I think this would be a longer discussion than I feel up for at the moment, but I disagree with your prediction. I agree that the representational geometry in the model will be important and that it will be set up to help the model, but interference of circuits in superposition cannot be arranged to be helpful in full generality. If it were, I would take that as pretty strong evidence that whatever is going on in the model is not well-described by the framework of superposition at all.