Do wider language models learn more fine-grained features?
The superposition hypothesis suggests that language models learn features as pairwise almost-orthogonal directions in N-dimensional space.
Fact: The number of admissible pairwise orthogonal features in R^N grows exponentially in N.
Corollary: wider models can learn exponentially more features. What do they use this ‘extra bandwidth’ for?
Hypothesis 1: Wider models learn approximately the same set of features as the narrower models, but also learn many more long-tail features.
Hypothesis 2: Wider models learn a set of features which are on average more fine-grained than the features present in narrower models. (Note: This idea is essentially feature splitting, but as it pertains to the model itself as opposed to a sparse autoencoder trained on the model.)
Concrete experiment idea:
Train two models of differing width but same depth to approximately the same capability level (note this likely entails training the narrower model for longer).
Try to determine ‘how many features’ are in each, e.g. by training an SAE of fixed width and counting the number of interpretable latents. (Possibly there exists some simpler way to do this; haven’t thought about it long.)
Try to match the features in the smaller model to the features in the larger model.
Try to compare the ‘coarseness’ of the features. Probably this is best done by looking at feature dashboards, although it’s possible some auto-interpy thing will work
Review: This experiment is probably too complicated to make work within a short timeframe. Some conceptual reframing needed to make it more tractable
Prediction: the SAE results may be better for ‘wider’, but only if you control for something else, possibly perplexity, or regularize more heavily. The literature on wider vs deeper NNs has historically shown a stylized fact of wider NNs tending to ‘memorize more, generalize less’ (which you can interpret as the finegrained features being used mostly to memorizing individual datapoints, perhaps exploiting dataset biases or nonrobust features) and so deeper NNs are better (if you can optimize them effectively without exploding/collapsing gradients), which would potentially more than offset any orthogonality gains from the greater width. Thus, you would either need to regularize more heavily (‘over-regularizing’ from the POV of the wide net, because it would achieve a better performance if it could memorize more of the long tail, the way it ‘wants’ to) or otherwise adjust for performance (to disentangle the performance benefits of wideness from the distorting effect of achieving that via more memorization).
Intuition pump: When you double the size of the residual stream, you get a squaring in the number of distinct (almost-orthogonal) features that a language model can learn. If we call a model X and its twice-as-wide version Y, we might expect that the features in Y all look like Cartesian pairs of features in X.
Re-stating an important point: this suggests that feature splitting could be something inherent to language models as opposed to an artefact of SAE training.
I suspect this could be elucidated pretty cleanly in a toy model. Need to think more about the specific toy setting that will be most appropriate. Potentially just re-using the setup from Toy Models of Superposition is already good enough.
Related idea: If we can show that models themselves learn featyres of different granularity, we could then test whether SAEs reflect this difference. (I expect they do not.) This would imply that SAEs capture properties of the data rather than the model.
Do wider language models learn more fine-grained features?
The superposition hypothesis suggests that language models learn features as pairwise almost-orthogonal directions in N-dimensional space.
Fact: The number of admissible pairwise orthogonal features in R^N grows exponentially in N.
Corollary: wider models can learn exponentially more features. What do they use this ‘extra bandwidth’ for?
Hypothesis 1: Wider models learn approximately the same set of features as the narrower models, but also learn many more long-tail features.
Hypothesis 2: Wider models learn a set of features which are on average more fine-grained than the features present in narrower models. (Note: This idea is essentially feature splitting, but as it pertains to the model itself as opposed to a sparse autoencoder trained on the model.)
Concrete experiment idea:
Train two models of differing width but same depth to approximately the same capability level (note this likely entails training the narrower model for longer).
Try to determine ‘how many features’ are in each, e.g. by training an SAE of fixed width and counting the number of interpretable latents. (Possibly there exists some simpler way to do this; haven’t thought about it long.)
Try to match the features in the smaller model to the features in the larger model.
Try to compare the ‘coarseness’ of the features. Probably this is best done by looking at feature dashboards, although it’s possible some auto-interpy thing will work
Review: This experiment is probably too complicated to make work within a short timeframe. Some conceptual reframing needed to make it more tractable
Prediction: the SAE results may be better for ‘wider’, but only if you control for something else, possibly perplexity, or regularize more heavily. The literature on wider vs deeper NNs has historically shown a stylized fact of wider NNs tending to ‘memorize more, generalize less’ (which you can interpret as the finegrained features being used mostly to memorizing individual datapoints, perhaps exploiting dataset biases or nonrobust features) and so deeper NNs are better (if you can optimize them effectively without exploding/collapsing gradients), which would potentially more than offset any orthogonality gains from the greater width. Thus, you would either need to regularize more heavily (‘over-regularizing’ from the POV of the wide net, because it would achieve a better performance if it could memorize more of the long tail, the way it ‘wants’ to) or otherwise adjust for performance (to disentangle the performance benefits of wideness from the distorting effect of achieving that via more memorization).
Intuition pump: When you double the size of the residual stream, you get a squaring in the number of distinct (almost-orthogonal) features that a language model can learn. If we call a model X and its twice-as-wide version Y, we might expect that the features in Y all look like Cartesian pairs of features in X.
Re-stating an important point: this suggests that feature splitting could be something inherent to language models as opposed to an artefact of SAE training.
I suspect this could be elucidated pretty cleanly in a toy model. Need to think more about the specific toy setting that will be most appropriate. Potentially just re-using the setup from Toy Models of Superposition is already good enough.
Related idea: If we can show that models themselves learn featyres of different granularity, we could then test whether SAEs reflect this difference. (I expect they do not.) This would imply that SAEs capture properties of the data rather than the model.