Possibly the source of our disagreement here is that you are imagining the neuron ought to be strictly monotonically increasing in activation relative to the dog-headedness of the image?
If we abandon that assumption then it is relatively clear how to encode two numbers in 1D. Let’s assume we observe two numbers X,Y. With probability p, X=0,Y∼N(0,1), and with probability (1−p), Y=0,X∼N(0,1).
We now want to encode these two events in some third variable Z, such that we can perfectly reconstruct X,Y with probability ≈1.
I put the solution behind a spoiler for anyone wanting to try it on their own.
Choose some veeeery large μ≫1 (much greater than the variance of the normal distribution of the features). For the first event, set Z=Y−μ. For the second event, set Z=X+μ.
The decoding works as follows:
If Z is negative, then with probability ≈1 we are in the first scenario and we can set X=0,Y=Z+μ. Vice versa if Z is positive.
Ah, I see. Thank you for pointing this out. Do superposition features actually seem to work like this in practice in current networks? I was not aware of this.
In any case, for a network like the one you describe I would change my claim from
it’d mean that to the AI, dog heads and car fronts are “the same thing”.
to the AI having a concept for something humans don’t have a neat short description for. So for example, if your algorithm maps X>0 Y>0 to the first case, I’d call it a feature of “presence of dog heads or car fronts, or presence of car fronts”.
I don’t think this is an inherent problem for the theory. That a single floating point number can contain a lot of information is fine, so long as you have some way to measure how much it is.
Do superposition features actually seem to work like this in practice in current networks? I was not aware of this.
I’m not aware of any work that identifies superposition in exactly this way in NNs of practical use. As Spencer notes, you can verify that it does appear in certain toy settings though. Anthropic notes in their SoLU paper that they view their results as evidence for the SPH in LLMs. Imo the key part of the evidence here is that using a SoLU destroys performance but adding another LayerNorm afterwards solves that issue. The SoLU selects strongly against superposition and LayerNorm makes it possible again, which is some evidence that the way the LLM got to its performance was via superposition.
ETA: Ofc there could be some other mediating factor, too.
This example is meant to only illustrate how one could achieve this encoding. It’s not how an actual autoencoder would work. An actual NN might not even use superposition for the data I described and it might need some other setup to elicit this behavior. But to me it sounded like you are sceptical that superposition is nothing but the network being confused whereas I think it can be the correct way to still be able to reconstruct the features to a reasonable degree.
Not confused, just optimised to handle data of the kind seen in training, and with limited ability to generalise beyond that, compared to human vision.
Possibly the source of our disagreement here is that you are imagining the neuron ought to be strictly monotonically increasing in activation relative to the dog-headedness of the image?
If we abandon that assumption then it is relatively clear how to encode two numbers in 1D. Let’s assume we observe two numbers X,Y. With probability p, X=0,Y∼N(0,1), and with probability (1−p), Y=0,X∼N(0,1).
We now want to encode these two events in some third variable Z, such that we can perfectly reconstruct X,Y with probability ≈1.
I put the solution behind a spoiler for anyone wanting to try it on their own.
Choose some veeeery large μ≫1 (much greater than the variance of the normal distribution of the features). For the first event, set Z=Y−μ. For the second event, set Z=X+μ.
The decoding works as follows:
If Z is negative, then with probability ≈1 we are in the first scenario and we can set X=0,Y=Z+μ. Vice versa if Z is positive.
Ah, I see. Thank you for pointing this out. Do superposition features actually seem to work like this in practice in current networks? I was not aware of this.
In any case, for a network like the one you describe I would change my claim from
to the AI having a concept for something humans don’t have a neat short description for. So for example, if your algorithm maps X>0 Y>0 to the first case, I’d call it a feature of “presence of dog heads or car fronts, or presence of car fronts”.
I don’t think this is an inherent problem for the theory. That a single floating point number can contain a lot of information is fine, so long as you have some way to measure how much it is.
I’m not aware of any work that identifies superposition in exactly this way in NNs of practical use.
As Spencer notes, you can verify that it does appear in certain toy settings though. Anthropic notes in their SoLU paper that they view their results as evidence for the SPH in LLMs. Imo the key part of the evidence here is that using a SoLU destroys performance but adding another LayerNorm afterwards solves that issue. The SoLU selects strongly against superposition and LayerNorm makes it possible again, which is some evidence that the way the LLM got to its performance was via superposition.
ETA: Ofc there could be some other mediating factor, too.
This example is meant to only illustrate how one could achieve this encoding. It’s not how an actual autoencoder would work. An actual NN might not even use superposition for the data I described and it might need some other setup to elicit this behavior.
But to me it sounded like you are sceptical that superposition is nothing but the network being confused whereas I think it can be the correct way to still be able to reconstruct the features to a reasonable degree.
Not confused, just optimised to handle data of the kind seen in training, and with limited ability to generalise beyond that, compared to human vision.
Yeah I agree with that. But there is also a sense in which some (many?) features will be inherently sparse.
A token is either the first one of multi-token word or it isn’t.
A word is either a noun, a verb or something else.
A word belongs to language LANG and not to any other language/has other meanings in those languages.
A H×W image can only contain so many objects which can only contain so many sub-aspects.
I don’t know what it would mean to go “out of distribution” in any of these cases.
This means that any network that has an incentive to conserve parameter usage (however we want to define that), might want to use superposition.