Quick Thoughts on Scaling Monosemanticity

Link post

1. How Many Features are Active at Once?

Previously I’ve seen the rule of thumb “20-100 for most models”. Anthropic says:

For all three SAEs, the average number of features active (i.e. with nonzero activations) on a given token was fewer than 300

2. Splitting SAEs

Having multiple different-sized SAEs for the same model seems useful. The dashboard shows feature splitting clearly. I hadn’t ever thought of comparing features from different SAEs using cosine similarity and plotting them together with UMAP.

3. Leaky Features

Neither of these plots seems great. They both suggest to me that these SAEs are “leaky” in some sense at lower activation levels, but in opposite ways:

  1. Activating on irrelevant data

  2. Activating unexpectedly weakly on relevant data

For reference, here are the meanings of the specificity scores:

  • 0 – The feature is completely irrelevant throughout the context (relative to the base distribution of the internet).

  • 1 – The feature is related to the context, but not near the highlighted text or only vaguely related.

  • 2 – The feature is only loosely related to the highlighted text or related to the context near the highlighted text.

  • 3 – The feature cleanly identifies the activating text.

Note the low bar for a score of 1 and compare how much area 0 and 1-scored activations take. It looks to me like we can really only trust features above a rough 0.3-0.4 activation level. But note what a small fraction of the total activations have that strength!

As in Towards Monosemanticity, we see that these features become less specific as the activation strength weakens. This could be due to the model using activation strengths to represent confidence in a concept being present. Or it may be that the feature activates most strongly for central examples of the feature, but weakly for related ideas – for example, the Golden Gate Bridge feature 34M/​31164353 appears to weakly activate for other San Francisco landmarks. It could also reflect imperfection in our dictionary learning procedure. For example, it may be that the architecture of the autoencoder is not able to extract and discriminate among features as cleanly as we might want. And of course interference from features that are not exactly orthogonal could also be a culprit, making it more difficult for Sonnet itself to activate features on precisely the right examples. It is also plausible that our feature interpretations slightly misrepresent the feature’s actual function, and that this inaccuracy manifests more clearly at lower activations.

4. On Scaling

We think it’s quite likely that we’re orders of magnitude short, and that if we wanted to get all the features – in all layers! – we would need to use much more compute than the total compute needed to train the underlying models.

They don’t give the exact model size (either the depth or d_model). But as a very rough estimate, suppose their model has depth 100. This paper is about a single layer (somewhere in the middle of the model). Imagine doing all of this work 100 times! First, the cost of training multip SAEs and then the cost of analyzing them. The analysis can probably be mostly automated, but that’s still going to be expensive (and take time).

I’m interested in ideas for training SAEs for all layers simultaneously, but if you imagine SAEs expanding the model by 32x (for example), then this would naively take 32x the compute of training the original model, or at least 32x memory if not 32x FLOPs. (This analysis is naive because they’re expanding the residual stream, not the actual MLP /​ attention parameters, but it should be directionally correct).

All of this work is going to look very similar across different layers, with subtle shifts in meaning (in the same way that the Logit Lens treats all layers as meaning the same thing but the Tuned Lens corrects for this).

5. Good News /​ Bad News

I didn’t notice many innovations here—it was mostly scaling pre-existing techniques to a larger model than I had seen previously. The good news is that this worked well. The bad news is that none of the old challenges have gone away.

6. Features Still Seem Crude and Hard to Steer With

We’d really like to understand the model’s model of the world. For example, when working with my coworkers I have a good idea of what they know or don’t, including some idea of both their general background, strengths and weaknesses, what we’ve worked on together, and their current state of mind. I’d expect language models to model their interlocutor in a similar way, but the best we can currently say is “300 features, including Openness and Honesty, etc, are active.”

7. Predicting the Presence of a Feature

There’s a very nice fit on this curve.

8. Missing Features

For instance, we confirmed that Claude 3 Sonnet can list all of the London boroughs when asked, and in fact can name tens of individual streets in many of the areas. However, we could only find features corresponding to about 60% of the boroughs in the 34M SAE.

You could potentially solve this by scaling SAEs way up, but that just makes the compute challenges even worse. I haven’t seen any research on this but you could imagine training the SAE to generate specific features that you want to appear.

9. The Thatcher Feature

Why does the Thatcher feature treat her name so inconsistently (re the tokens it fires strongly /​ weakly on)?

10. The Lincoln Feature

The Lincoln feature is remarkably clean and clearly shows how smoothly the model handles different tokenizations.

11. The Rwanda Feature

It feels a bit implausible to me that this is genuinely a Rwanda feature if it doesn’t fire on “Rwanda” at all.

12. The Los Angeles Feature

323 and 213 are apparently LA area codes. Why does the model fire more strongly on later tokens, which aren’t LA-specific? Similarly with the URLs.

13. Activation Strength vs Attribution

only three out of the ten most strongly active features are among the ten features with highest ablation effect. In comparison, eight out of the ten most strongly attributed features are among the ten features with highest ablation effect.

14. Other Takes

EIS XIII: Reflections on Anthropic’s SAE Research Circa May 2024

eggsyntax’s Shortform