One subtlety which I’d expect is relevant here: when two singular vectors have approximately the same singular value, the two vectors are very numerically unstable (within their span).
Suppose that two singular vectors have the same singular value. Then in the SVD, we have two terms of the form
[u1u2][s100s1][vT1vT2]
(where the ui‘s and vi’s are column vectors). That middle part is just the shared singular value s1 times a 2x2 identity matrix:
=s1[u1u2][1001][vT1vT2]
But the 2x2 identity matrix can be rewritten as a 2x2 rotation R times its inverse RT:
=s1[u1u2]RRT[vT1vT2]
… and then we can group R and RT with U and V, respectively, to rotate the singular vectors:
=([u1u2]R)[s100s1](RT[vT1vT2])
Since UR and RTV are still orthogonal, the end result is another valid singular vector decomposition of the same matrix.
Upshot: when a singular value is repeated, the singular vectors are defined only up to a rotation (where the dimension of the rotation is the number of repeats of the singular value).
What this means practically/conceptually is that, if two singular vectors have very close singular values, then a small amount of noise in the matrix will typically “mix them together”. So for instance, the post shows a plot of singular vectors for the OV matrix, and a whole bunch of the singular values are very close together. Conceptually, that means the corresponding singular vectors are all probably “mixed together” to a large extent. Insofar as they all have roughly-the-same singular value, the singular vectors themselves are underdefined/unstable; what’s fully specified is the span of singular vectors with the same singular value.
(In fact, for the singular value distribution shown for the OV matrix in the post, nearly all the singular values are either approximately 10, or approximately 0. So that particular matrix is approximately a projection matrix, and the span of the singular vectors on either side gives the space projected from/to.)
Great point. I agree that the singular vectors become unstable when the singular values are very close (and meaningless within the span when identical). However I don’t think this is the main driver of the effect in the post. The graph of the singular vectors shown is quite misleading about the gap (this was my bad!). Because the OV matrix is effectively of rank 64, there is the sudden jump down to almost 0 which dominates the log-scale plotting. I was originally using that graph to try to show that effect, but in retrospect it is kind of an obvious one and not super interesting. I’ve replotted that graph to now cut-off at 64 and you can see that the singular values are actually reasonably spaced in log-space and roughly have an exponential decay to about 0.6. None of them are super close to their neighbours in a way that I think is likely to cause this instability.
Interestingly, the spectrums you get from doing this are very consistent across heads and you also see them in a non-truncated way in the MLP weight matrices where you see a consistent power-law spectrum.
One subtlety which I’d expect is relevant here: when two singular vectors have approximately the same singular value, the two vectors are very numerically unstable (within their span).
Suppose that two singular vectors have the same singular value. Then in the SVD, we have two terms of the form
[u1u2][s100s1][vT1vT2]
(where the ui‘s and vi’s are column vectors). That middle part is just the shared singular value s1 times a 2x2 identity matrix:
=s1[u1u2][1001][vT1vT2]
But the 2x2 identity matrix can be rewritten as a 2x2 rotation R times its inverse RT:
=s1[u1u2]RRT[vT1vT2]
… and then we can group R and RT with U and V, respectively, to rotate the singular vectors:
=([u1u2]R)[s100s1](RT[vT1vT2])
Since UR and RTV are still orthogonal, the end result is another valid singular vector decomposition of the same matrix.
Upshot: when a singular value is repeated, the singular vectors are defined only up to a rotation (where the dimension of the rotation is the number of repeats of the singular value).
What this means practically/conceptually is that, if two singular vectors have very close singular values, then a small amount of noise in the matrix will typically “mix them together”. So for instance, the post shows a plot of singular vectors for the OV matrix, and a whole bunch of the singular values are very close together. Conceptually, that means the corresponding singular vectors are all probably “mixed together” to a large extent. Insofar as they all have roughly-the-same singular value, the singular vectors themselves are underdefined/unstable; what’s fully specified is the span of singular vectors with the same singular value.
(In fact, for the singular value distribution shown for the OV matrix in the post, nearly all the singular values are either approximately 10, or approximately 0. So that particular matrix is approximately a projection matrix, and the span of the singular vectors on either side gives the space projected from/to.)
Great point. I agree that the singular vectors become unstable when the singular values are very close (and meaningless within the span when identical). However I don’t think this is the main driver of the effect in the post. The graph of the singular vectors shown is quite misleading about the gap (this was my bad!). Because the OV matrix is effectively of rank 64, there is the sudden jump down to almost 0 which dominates the log-scale plotting. I was originally using that graph to try to show that effect, but in retrospect it is kind of an obvious one and not super interesting. I’ve replotted that graph to now cut-off at 64 and you can see that the singular values are actually reasonably spaced in log-space and roughly have an exponential decay to about 0.6. None of them are super close to their neighbours in a way that I think is likely to cause this instability.
Interestingly, the spectrums you get from doing this are very consistent across heads and you also see them in a non-truncated way in the MLP weight matrices where you see a consistent power-law spectrum.