Alex Gibson comments on Rotations in Superposition

Alex Gibson 26 Dec 2025 20:04 UTC
1 point
0
We could allocate more of the total storage to on-indicators information and less to the rotated vector. Such a shift may be more optimal.
This is all I mean. Having the small circuits do error correction on on-indicator neurons is just a way of increasing that total percentage storage allocated to on-indicators in the larger network in an embedding agnostic manner. You can change your method for constructing the larger network later and this strategy would be orthogonal to such a change.
I think allocating a higher percentage to on-indicators should still apply even when adding cross-circuit computation.
Possibly there are some things you can eliminate right away? But I think often not. In the transformer architecture, at the start, the network just has the embedding vector for the first token and the positional embedding. After the first attention, the network has a bit more information, but not that much, the softmax will make sure the network just focus on a few previous words (right?). And every step of computation (including attention) will come with some noise, if superposition is involved.
The assumption “softmax will make sure the network just focus on a few previous words (right?)” is true for many attention heads, but not all of them. Some attention heads attend broadly across the whole sequence, aggregating many different tokens together to get the ‘gist’ of a text.
By the end of the first layer of GPT-2 Small, it has constructed a continuous linear summary of the previous $50$ tokens, and so has a Word2Vec style vector in its residual stream. So it knows that the text is about World War I, or about AI, that it is written in British English, that is is formal/informal, etc, all by the end of the first attention layer (before the first MLP). This is lots of information to go on in terms of turning off circuits etcetera (think Mixture-Of-Experts).
There is a regime where the (updated) framework works. See figure 8-11 for values T(d/D)^2 < 0.004. However for sizes of networks I can run on my laptop, that does not leave room for very much superposition.
Nice, ok. So asymptotically it works fine then? So the next step theory wise is having a framework that allows for cross-circuit computation, I guess.
Do you want to have a call some time in January?
There are probably lots of thing that isn’t explained as well as it could have been in the post
I would be grateful if you get the time. I have recently had some more ideas that are more useful potentially about this stuff that it might be worth discussing.