Being linearly independent is sufficient in this case to read off each x_i with zero interference. Rank-one matrices are equivalent to (linear functional) * vector, and so we just pick the dual basis as our linear functionals, and extend them to whole space.
If you have 100 orthogonal linear probes to read with, yes. But since there’s only 50 neurons, the actual circuits for different input features in the network will have interference to deal with.
Edit: To be clear, of course the network itself cannot perform the task precisely. I’m simply claiming that you can precisely mimic the behaviour of Win with 100 rank-1 components, by just reading off the basis, as SPD does in this case. The fact that the vi themselves are not orthogonal is irrelevant.
To be concrete: if we have 100 linearly independent vectors vi, we can extend this to a basis of the whole 1000-dimensional space. Let V be the change of basis matrix from the standard basis to this basis. Then we can write Win = (WinV−1)V, where we can pick (WinV−1) arbitrarily.
If we write WinV−1 as a sum of rank-1 matrices Wc, then WcV will sum to Win, and WcV is still rank-one since the image of WcV = image of Wc
So we can assume wlog that our vi lie along the standard basis, i.e: that they are orthogonal with respect to standard inner product.
As a more productive question, say we had an LLM, which, amongst other things, if there is a known bigram encoded in the residual stream of the form t1A+t2B (corresponding to known bigram t1t2), potentially with interference from other aspects, outputs a consistent vector vt1t2 into the residual stream from an MLP layer. This is how GPT-2 encodes known bigrams, hence the relevance.
And say that there are quadratically many known bigrams as a function of hidden neuron size, so that in particular there are more bigrams than residual stream dimension. As far as I know, an appropriately randomly initialized network should be able to accomplish this task (or at least with Win random)
Is the goal for SPD to learn components for Win such that any given component only fires non-negligibly on a single bigram? Or is it ok if components fire on multiple different bigrams? I am trying to reason through how SPD would act in this case.
I’d have to think about the exact setup here to make sure there’s no weird caveats, but my first thought is that for Win, this ought to be one component per bigram, firing exclusively for that bigram.
Sure, yes, that’s right. But I still wouldn’t take this to be equivalent to our vi literally being orthogonal, because the trained network itself might not perfectly learn this transformation.
If there is some internal gradient descent reason for it being easier to learn to read off orthogonal vectors then I take it back. I feel like I am being too pedantic here, in any case.
An intuition pump: Imagine the case of two scalar features c1,c2 being embedded along vectors f1,f2. If you consider a series that starts with f1,f2 being orthogonal, then gives them ever higher cosine similarity, I’d expect the network to have ever more trouble learning to read out c1,c2 , until we hit f1=f2, at which point the network definitely cannot learn to read the features out at all. I don’t know how the learning difficulty behaves over this series exactly, but it sure seems to me like it ought to go up monotonically at least.
Another intuition pump: The higher the cosine similarity between the features, the larger the norm of the rows of V−1 will be, with norm infinity in the limit of cosine similarity going to one.
I agree that at cosine similarity O(1√1000), it’s very unlikely to be a big deal yet.
Being linearly independent is sufficient in this case to read off each x_i with zero interference. Rank-one matrices are equivalent to (linear functional) * vector, and so we just pick the dual basis as our linear functionals, and extend them to whole space.
If you have 100 orthogonal linear probes to read with, yes. But since there’s only 50 neurons, the actual circuits for different input features in the network will have interference to deal with.
Edit: To be clear, of course the network itself cannot perform the task precisely. I’m simply claiming that you can precisely mimic the behaviour of Win with 100 rank-1 components, by just reading off the basis, as SPD does in this case. The fact that the vi themselves are not orthogonal is irrelevant.
To be concrete: if we have 100 linearly independent vectors vi, we can extend this to a basis of the whole 1000-dimensional space. Let V be the change of basis matrix from the standard basis to this basis. Then we can write Win = (WinV−1)V, where we can pick (WinV−1) arbitrarily.
If we write WinV−1 as a sum of rank-1 matrices Wc, then WcV will sum to Win, and WcV is still rank-one since the image of WcV = image of Wc
So we can assume wlog that our vi lie along the standard basis, i.e: that they are orthogonal with respect to standard inner product.
As a more productive question, say we had an LLM, which, amongst other things, if there is a known bigram encoded in the residual stream of the form t1A+t2B (corresponding to known bigram t1t2), potentially with interference from other aspects, outputs a consistent vector vt1t2 into the residual stream from an MLP layer. This is how GPT-2 encodes known bigrams, hence the relevance.
And say that there are quadratically many known bigrams as a function of hidden neuron size, so that in particular there are more bigrams than residual stream dimension. As far as I know, an appropriately randomly initialized network should be able to accomplish this task (or at least with Win random)
Is the goal for SPD to learn components for Win such that any given component only fires non-negligibly on a single bigram? Or is it ok if components fire on multiple different bigrams? I am trying to reason through how SPD would act in this case.
I’d have to think about the exact setup here to make sure there’s no weird caveats, but my first thought is that for Win, this ought to be one component per bigram, firing exclusively for that bigram.
Sure, yes, that’s right. But I still wouldn’t take this to be equivalent to our vi literally being orthogonal, because the trained network itself might not perfectly learn this transformation.
If there is some internal gradient descent reason for it being easier to learn to read off orthogonal vectors then I take it back. I feel like I am being too pedantic here, in any case.
An intuition pump: Imagine the case of two scalar features c1,c2 being embedded along vectors f1,f2. If you consider a series that starts with f1,f2 being orthogonal, then gives them ever higher cosine similarity, I’d expect the network to have ever more trouble learning to read out c1,c2 , until we hit f1=f2, at which point the network definitely cannot learn to read the features out at all. I don’t know how the learning difficulty behaves over this series exactly, but it sure seems to me like it ought to go up monotonically at least.
Another intuition pump: The higher the cosine similarity between the features, the larger the norm of the rows of V−1 will be, with norm infinity in the limit of cosine similarity going to one.
I agree that at cosine similarity O(1√1000), it’s very unlikely to be a big deal yet.
Yeah that makes sense, thanks.