Lee Sharkey comments on [Interim research report] Taking features out of superposition with sparse autoencoders

Lee Sharkey 6 Dec 2023 14:21 UTC
LW: 33 AF: 19
0
AF
Comments on the outcomes of the post:
- I’m reasonably happy with how this post turned out. I think it probably bought the Anthropic/superposition mechanistic interpretability agenda somewhere between 0.1 to 4 counterfactual months of progress, which feels like a win.
- I think sparse autoencoders are likely to be a pretty central method in mechanistic interpretability work for the foreseeable future (which tbf is not very foreseeable).
- Two parallel works used the method identified in the post (sparse autoencoders—SAEs) or slight modification:
  - Cunningham et al. (2023)(https://arxiv.org/abs/2309.08600), a project which I supervised.
  - Bricken et al. (2023)(https://transformer-circuits.pub/2023/monosemantic-features), the Anthropic paper ‘Towards Monosemanticity’.
- That two teams were able to use the results to explore complementary directions in parallel I think partly validates Conjecture’s policy (at that time) of publishing quick, scrappy results that optimize for impact rather than rigour. I make this note because that policy attracted some criticism that I perceived to be undue, and to highlight that some of the benefits of the policy can only be observed after longer periods.
Some regrets related to the post:
- It was pretty silly of me to divide the L1 loss by the number of dictionary elements. The idea was that this means that the L1 loss per dictionary element remains roughly constant even as you scale dictionaries. But that isn’t what you want—you want more penalty as you scale, assuming the number of data-generating features is fixed. This made it more difficult than it needed to be to find the right hyperparameters. Fortunately, Logan Smith (iirc) identified this issue while working on Cunningham et al.
- The language model results were underwhelming. I strongly suspect they were undertrained. This was a addressed in a follow up post (https://www.alignmentforum.org/posts/DezghAd4bdxivEknM/a-small-update-to-the-sparse-coding-interim-research-report).
- I regret giving a specific number of potential features: “Here we found very weak, tentative evidence that, for a model of size d_model = 256, the number of features in superposition was over 100,000. This is a large scaling factor and it’s only a lower bound. If the estimated scaling factor is approximately correct (and, we emphasize, we’re not at all confident in that result yet) or if it gets larger, then this method of feature extraction is going to be very costly to scale to the largest models – possibly more costly than training the models themselves. ” Despite all the qualifications and expressions of deep uncertainty, I got the impression that many people read too much into this. I think avoiding publishing the LM results or not giving a specific figure could have avoided this misunderstanding.
Outlying issues:
- In their current formulation, SAEs leave a few important problems unaddressed, including:
  - SAEs probably don’t learn the most functionally relevant features. They find directions in the activations that are separable, but that doesn’t necessarily reflect the network’s ontology. The features learned by SAEs are probably too granular.
  - SAEs don’t automatically provide a way to summarize the interactions between features (i.e. there is a gap between features and circuits).
  - The SAEs used in the above mentioned papers aren’t a very satisfying solution to dealing with attention head polysemanticity.
  - SAEs optimize two losses: Reconstruction and L1. The L1 loss penalizes the feature coefficients. I think this penalty means that, in expectation, they’ll systematically undershoot the correct prediction for the coefficients (this has been observed empirically in private correspondence).
I and collaborators are working on each of these problems.
What links here?
- Voting Results for the 2022 Review by Ben Pace (2 Feb 2024 20:34 UTC; 57 points)
- habryka's comment on The LessWrong 2022 Review: Review Phase by RobertM (10 Jan 2024 22:04 UTC; 17 points)
- Arthur Conmy 7 Dec 2023 22:24 UTC
  LW: 2 AF: 2
  0
  AF Parent
  I really appreciated this retrospective, this changed my mind about the sparsity penalty, thanks!
  - Lee Sharkey 2 Jan 2024 14:51 UTC
    LW: 1 AF: 1
    0
    AF Parent
    Great! I’m curious, what was it about the sparsity penalty that you changed your mind about?
    - Arthur Conmy 3 Jan 2024 16:43 UTC
      LW: 5 AF: 2
      1
      AF Parent
      I previously thought that L1 penalties were just exactly what you wanted to do sparse reconstruction.
      Thinking about your undershooting claim, I came up with a toy example that made it obvious to me that the Anthropic loss function was not optimal: suppose you are role-playing a single-feature SAE reconstructing the number 2, and are given loss equal to the squared error of your guess, plus the norm of your guess. Then guessing x>0 gives loss minimized at x=3/2, not 2
      - Lee Sharkey 3 Jan 2024 21:50 UTC
        LW: 1 AF: 1
        0
        AF Parent
        Makes sense! Thanks!