I’m reasonably happy with how this post turned out. I think it probably bought the Anthropic/superposition mechanistic interpretability agenda somewhere between 0.1 to 4 counterfactual months of progress, which feels like a win.
I think sparse autoencoders are likely to be a pretty central method in mechanistic interpretability work for the foreseeable future (which tbf is not very foreseeable).
Two parallel works used the method identified in the post (sparse autoencoders—SAEs) or slight modification:
That two teams were able to use the results to explore complementary directions in parallel I think partly validates Conjecture’s policy (at that time) of publishing quick, scrappy results that optimize for impact rather than rigour. I make this note because that policy attracted some criticism that I perceived to be undue, and to highlight that some of the benefits of the policy can only be observed after longer periods.
Some regrets related to the post:
It was pretty silly of me to divide the L1 loss by the number of dictionary elements. The idea was that this means that the L1 loss per dictionary element remains roughly constant even as you scale dictionaries. But that isn’t what you want—you want more penalty as you scale, assuming the number of data-generating features is fixed. This made it more difficult than it needed to be to find the right hyperparameters. Fortunately, Logan Smith (iirc) identified this issue while working on Cunningham et al.
I regret giving a specific number of potential features: “Here we found very weak, tentative evidence that, for a model of size d_model = 256, the number of features in superposition was over 100,000. This is a large scaling factor and it’s only a lower bound. If the estimated scaling factor is approximately correct (and, we emphasize, we’re not at all confident in that result yet) or if it gets larger, then this method of feature extraction is going to be very costly to scale to the largest models – possibly more costly than training the models themselves. ” Despite all the qualifications and expressions of deep uncertainty, I got the impression that many people read too much into this. I think avoiding publishing the LM results or not giving a specific figure could have avoided this misunderstanding.
Outlying issues:
In their current formulation, SAEs leave a few important problems unaddressed, including:
SAEs probably don’t learn the most functionally relevant features. They find directions in the activations that are separable, but that doesn’t necessarily reflect the network’s ontology. The features learned by SAEs are probably too granular.
SAEs don’t automatically provide a way to summarize the interactions between features (i.e. there is a gap between features and circuits).
The SAEs used in the above mentioned papers aren’t a very satisfying solution to dealing with attention head polysemanticity.
SAEs optimize two losses: Reconstruction and L1. The L1 loss penalizes the feature coefficients. I think this penalty means that, in expectation, they’ll systematically undershoot the correct prediction for the coefficients (this has been observed empirically in private correspondence).
I and collaborators are working on each of these problems.
I previously thought that L1 penalties were just exactly what you wanted to do sparse reconstruction.
Thinking about your undershooting claim, I came up with a toy example that made it obvious to me that the Anthropic loss function was not optimal: suppose you are role-playing a single-feature SAE reconstructing the number 2, and are given loss equal to the squared error of your guess, plus the norm of your guess. Then guessing x>0 gives loss minimized at x=3/2, not 2
Comments on the outcomes of the post:
I’m reasonably happy with how this post turned out. I think it probably bought the Anthropic/superposition mechanistic interpretability agenda somewhere between 0.1 to 4 counterfactual months of progress, which feels like a win.
I think sparse autoencoders are likely to be a pretty central method in mechanistic interpretability work for the foreseeable future (which tbf is not very foreseeable).
Two parallel works used the method identified in the post (sparse autoencoders—SAEs) or slight modification:
Cunningham et al. (2023)(https://arxiv.org/abs/2309.08600), a project which I supervised.
Bricken et al. (2023)(https://transformer-circuits.pub/2023/monosemantic-features), the Anthropic paper ‘Towards Monosemanticity’.
That two teams were able to use the results to explore complementary directions in parallel I think partly validates Conjecture’s policy (at that time) of publishing quick, scrappy results that optimize for impact rather than rigour. I make this note because that policy attracted some criticism that I perceived to be undue, and to highlight that some of the benefits of the policy can only be observed after longer periods.
Some regrets related to the post:
It was pretty silly of me to divide the L1 loss by the number of dictionary elements. The idea was that this means that the L1 loss per dictionary element remains roughly constant even as you scale dictionaries. But that isn’t what you want—you want more penalty as you scale, assuming the number of data-generating features is fixed. This made it more difficult than it needed to be to find the right hyperparameters. Fortunately, Logan Smith (iirc) identified this issue while working on Cunningham et al.
The language model results were underwhelming. I strongly suspect they were undertrained. This was a addressed in a follow up post (https://www.alignmentforum.org/posts/DezghAd4bdxivEknM/a-small-update-to-the-sparse-coding-interim-research-report).
I regret giving a specific number of potential features: “Here we found very weak, tentative evidence that, for a model of size d_model = 256, the number of features in superposition was over 100,000. This is a large scaling factor and it’s only a lower bound. If the estimated scaling factor is approximately correct (and, we emphasize, we’re not at all confident in that result yet) or if it gets larger, then this method of feature extraction is going to be very costly to scale to the largest models – possibly more costly than training the models themselves. ” Despite all the qualifications and expressions of deep uncertainty, I got the impression that many people read too much into this. I think avoiding publishing the LM results or not giving a specific figure could have avoided this misunderstanding.
Outlying issues:
In their current formulation, SAEs leave a few important problems unaddressed, including:
SAEs probably don’t learn the most functionally relevant features. They find directions in the activations that are separable, but that doesn’t necessarily reflect the network’s ontology. The features learned by SAEs are probably too granular.
SAEs don’t automatically provide a way to summarize the interactions between features (i.e. there is a gap between features and circuits).
The SAEs used in the above mentioned papers aren’t a very satisfying solution to dealing with attention head polysemanticity.
SAEs optimize two losses: Reconstruction and L1. The L1 loss penalizes the feature coefficients. I think this penalty means that, in expectation, they’ll systematically undershoot the correct prediction for the coefficients (this has been observed empirically in private correspondence).
I and collaborators are working on each of these problems.
I really appreciated this retrospective, this changed my mind about the sparsity penalty, thanks!
Great! I’m curious, what was it about the sparsity penalty that you changed your mind about?
I previously thought that L1 penalties were just exactly what you wanted to do sparse reconstruction.
Thinking about your undershooting claim, I came up with a toy example that made it obvious to me that the Anthropic loss function was not optimal: suppose you are role-playing a single-feature SAE reconstructing the number 2, and are given loss equal to the squared error of your guess, plus the norm of your guess. Then guessing x>0 gives loss minimized at x=3/2, not 2
Makes sense! Thanks!