Do sparse autoencoders find “true features”?

Demian Till22 Feb 2024 18:06 UTC

70 points

Interpretability (ML & AI)AI Sparse Autoencoders (SAEs)

Thanks to Joseph Bloom and James Oldfield for giving feedback on drafts which helped improve the post

In this post I’ll discuss an apparent limitation of sparse autoencoders (SAEs) in their current formulation as they are applied to discovering the latent features within AI models such as transformer-based LLMs. In brief, I’ll cover the following:

I’ll argue that the L1 regularisation used to promote sparsity when training SAEs may cause neurons in the sparse layer to learn to represent common combinations of features rather than the individual features that we want them to discover
As well as making it more difficult to understand what the actual latent features are, I’ll also argue that this limitation may result in some less common latent features not being discovered at all, not even within combinations
I’ll then explain why I think that the phenomenon of feature splitting observed in Anthropic’s SAE paper appears to demonstrate that this limitation does indeed have a large impact on the features discovered by SAEs
Finally I’ll propose an approach for overcoming this limitation and discuss how we can test whether it really brings us closer to finding the real latent features

Rough definition of “true features”

We intend for SAEs to discover the “true features” (a term I’m borrowing from Anthropic’s SAE paper) used by the target model e.g. a transformer-based LLM. There isn’t a universally accepted definition of what “true features” are, but for now I’ll use the term somewhat loosely to refer to something like:

linear directions in an activation space at a hidden layer within a target model which encode some reasonably monosemantic quantity such as the model’s “confidence” in some concept being in play
they should play a causal role in the functioning of the target model. So for example if we were to activate or deactivate the feature while the target model is processing a given input sequence then we should expect the outputs to change accordingly in some reasonably understandable way
they should be in their most atomic form, so that e.g an arbitrary linear combination of two “true feature” directions is not necessarily itself a “true feature” direction even though it may satisfy the previous criteria

There may be other ways of thinking about features but this should give us enough to work with for our current purposes.

Why SAEs are incentivised to discover combinations of features rather than individual features

Consider a toy setup where one of the hidden layers in the target model has 3 “true features” represented by the following directions in its activation space:

Additionally, suppose that feature 1 and feature 2 occur far more frequently than feature 3, and that all features can potentially co-occur in a given activation vector. For the sake of simplicity let’s also suppose for now that when features 1 & 2 occur together they tend to both activate with some roughly fixed proportions. For example, an activation vector in which both features 1 and 2 are present (but not feature 3) might look like the following:

Now suppose we train an SAE with 3 neurons in the sparse layer on activation vectors from this hidden layer such as the one above. The desirable outcome is that each of the 3 neurons in the sparse layer learns one of the 3 “true features”. If this happens then the directions learnt by SAE would mirror the directions of the “true features” in the target model, looking something like:

However depending on the respective frequencies of feature 3 vs features 1 & 2, as well as the value of the L1 regularisation weight, I will argue shortly that what may happen is that two of the neurons learn to detect when each of features 1 & 2 respectively occur by themselves, while the third neuron learns to detect when they both occur together. In this case the directions discovered by the SAE would look something like:

Note that for clarity I’m assuming that the SAE is trained with untied encoder/decoder weights and that it would be the decoder weights which would contain these directions and not the encoder. In this example this is because the encoder would need inhibitory weights to e.g. prevent neuron 1 from activating when both neurons 1 & 2 are present as we will discuss shortly. Also see Nanda’s discussion and findings here on SAE encoder/decoder weight tying.

There are two problems with the directions hypothetically learnt by the SAE here. One problem is that feature 3 hasn’t been represented at all, so we wouldn’t know anything about that feature from this SAE. The second problem is that one of the neurons has learnt a combination of features which may confuse us in our attempts to understand what the “true features” are. If we tried to interpret what causes neuron 3 to activate we may still get what seems like a reasonable human-understandable interpretation and it isn’t clear how we could tell which neurons correspond to individual “true features” and which correspond to combinations thereof. If there is a way then it would require additional effort/machinery.

Now let’s discuss why the SAE might learn these directions instead of the ones corresponding to the 3 “true features”. The key point is that by learning these directions it is able to achieve greater sparsity on average. If it learnt the directions we want it to learn then when features 1 & 2 occur together the SAE would need to activate both of the two neurons corresponding to these features:

However if neuron 3 learns the combination of features 1&2 then the SAE would only need to activate neuron 3, thus achieving greater sparsity:

Note that the encoder weights for these neurons would need to learn bias thresholds and in the case of neurons 1 & 2 inhibitory weights such that they only activate when the specific individual feature or combination thereof is active. There wouldn’t be any sparsity gain if all 3 neurons activated when both features 1 & 2 are present.

The most obvious counterargument here is that failing to learn a neuron corresponding to feature 3 comes with the cost of increased reconstruction error when feature 3 is present. However if feature 3 is sufficiently rare compared with the combination of features 1 & 2 and the L1 regularisation weight is sufficiently high then the increased reconstruction error will be outweighed by the gain in sparsity. If we decrease the L1 regularisation weight then we may get feature 3 represented but we may also lose the sparse representations we’re after, leaving us back at square one with polysemantic neurons. It isn’t clear that there should exist a sweet spot for the L1 weight, and as we’ll discuss shortly the ubiquity of feature splitting found in anthropic’s SAE paper despite trying different L1 weights suggests there may not be an ideal sweet spot in practical settings. Any value for the L1 weight would likely be a compromise between learning monosemantic neurons and avoiding learning feature combinations.

Another counterargument is that we can simply increase the number of neurons in the sparse layer in order to capture rarer features as well as combinations of more common features. In this toy example, if we increase the width of the sparse layer to 4 neurons then feature 3 would presumably be represented. That is if we maintain the assumption that features 1 & 2 always co-activate with roughly the same proportions, allowing the single combination neuron to capture these co-occurences. In this case our learnt directions might look something like:

However if we relax this assumption, again depending on frequencies of feature activations, our SAE may end up using the extra capacity in the sparse layer to learn additional combination directions capturing differing proportions of activation levels of features 1 & 2. In this case the feature directions learnt by the SAE could look something like:

In this case the extra capacity is used to represent more common combinations at the expense of representing feature 3. With a realistic target model there would be many more “true features”, many subsets of which would likely frequently co-occur with varying activation levels, which could result in a huge number of feature combination directions. Some of these combination directions will likely be sufficiently common compared with rarer individual features as to be prioritised by the SAE over said rarer individual features. But even if we increased the sparse layer width enough to capture all of these common feature combination directions along with all of the less common individual “true features”, we would still face the problem of being swamped with feature combinations with often only subtly varying interpretations, making it more difficult to find the “true features”.

Relation to feature splitting

We’ll now look at some findings from Anthropic’s SAE paper and discuss how they may be explained using the framework we’ve been developing. A surprising finding was that the “features” discovered by the SAEs they trained were often aligned with one another both conceptually and geometrically. I’m using “features” here to refer to interpretable directions found by the SAE which may or may not correspond to “true features” in the target model. For example two of the “features” they found had the following interpretations:

“the token ‘the’ in mathematical prose”
“the token ‘the’ in physics writing”

As well as being similar conceptually, they found that such “features” would also correspond to similar directions in activation space.

They found that this sort of phenomenon was ubiquitous in the “features” discovered by the SAEs they trained, with the “features” forming clusters of conceptual and geometric relatedness. Furthermore, they found that as they increased the width of the SAEs, the clusters would become more densely populated with increasingly nuanced distinctions between them.

They suggest that as they increase the width of the SAE, they may be converging on the “true features” which are represented in the target model. They hypothesise that these “true features” are even more densely packed and nuanced than the “features” they’ve found so far, and that the “features” they’ve found provide a sort of conceptual “summary” of the “true features”. The narrower the SAE, the more coarse grained the summary. For example in the 512-width SAE they found a neuron which seemed to correspond to “‘the’ in mathematical prose”, and in the 16,384 width SAE, they found one neuron which seemed to correspond to “‘the’ in the context of mathematics, especially complex analysis” and another neuron which seemed to correspond to “‘the’ in the context of mathematics, especially topology and abstract algebra”. Notice how the neuron in the 512 width SAE provides a sort of summary for the 2 neurons in the 16,384 width SAE, and those may in turn provide summaries for the even more nuanced “true features”.

However my theory is that this phenomenon is unlikely to be a reflection on the nature of the “true features”. Rather, this phenomenon seems to be in line with what our earlier analysis would predict as a result of SAEs finding feature combinations and having more capacity to do so as the width is increased. Suppose the “true features” in a given layer of a target model include ones with interpretations along the lines of:

“the token ‘the’”
“the current context is mathematical prose”
“the current context is physics writing”

And suppose that the directions for these features are close to orthogonal. Note that they would be very unlikely to be completely orthogonal due to superposition, but we might expect the target model to attempt to make them as close to orthogonal as possible to minimise interference between features. So the “true feature” directions might look something like this:

Of course in reality the activation space would be far higher dimensional and there would be many more “true features”.

Suppose that activation vectors from this layer frequently exhibit “the token ‘the’” along with one (but not both) of the other “true features” being active. Then based on our earlier reasoning, we might expect some of the directions learnt by an SAE trained on activations from this layer to look something like:

Here we have 3 neurons that have learnt the individual feature directions + two more that have learnt commonly co-occurring combinations.

Note that they actually found hundreds of different “features” for “the in the context of []” for different contexts. Thus another possibility is that the “features” learnt by the SAE could look something like:

This would be possible if the SAE has learnt enough feature combinations so as to cover all of the combinations that the constituent individual “true features” are likely to appear in. This would preclude the need to learn the individual “true features” since all or most appearances of these features could be represented by one of the combination neurons.

Notice that since the feature combinations above share the feature “the token ‘the’” as a component, they are aligned both conceptually and geometrically, as was the case with the ‘features’ discovered in the paper.

If we were to try to interpret neurons 1&2 without knowing the “true features” in the target model, just by looking at what activation vectors cause the neurons to activate, we might come up with interpretations similar to ones from the paper such as:

“the token ‘the’ in mathematical prose”
“the token ‘the’ in physics writing”

As for why the SAE features form clusters, one possibility is that this is a result of the existence of clusters of “true features” which tend to occur together in the same activation vectors. Perhaps there is a property of the data used to prompt the transformer target model whereby input sequences tend to pertain to a certain topic, and each topic has a set of features which are more likely to occur in that topic. The SAE is then more likely to learn feature combinations for these commonly co-occurring clusters of features.

Now let’s explore why the “features” seem to “split”, becoming increasingly specific. A narrow SAE could use its limited capacity to capture the most commonly occurring combinations along with the individual features. For example, if the “true features” include the following:

“the token ‘the’”
“the current context is mathematical prose”
“the current topic is topology”
“the current topic is abstract algebra”

then we might expect a relatively narrow SAE to capture the combination, potentially along with capturing the individual features:

“the token ‘the’” AND “the current context is mathematical prose”

since this covers all mathematical prose, including topology and abstract algebra, and would thus likely occur more frequently than either of the sub-topics.

But a wider SAE might use its extra capacity to capture the less common combinations:

“the token ‘the’” AND “the current context is mathematical prose” AND “the current topic is topology”
“the token ‘the’” AND “the current context is mathematical prose” AND “the current topic is abstract algebra”

This could explain why features tend to get increasingly specific as the width increases and why features in narrower SAEs can be seen as summarising those found by wider SAEs.

Proposed solution

A naive idea for a solution to SAEs learning feature combinations could involve trying to adjust the width of the SAE to roughly match the number of “true features”. One issue here is that we don’t currently have any way to know how many “true features” we are trying to find. Even if we somehow knew the number of “true features”, this approach would be unlikely to work due to the issue discussed earlier where some feature combinations may occur sufficiently frequently such that the sparsity gained by learning those combinations outweighs the reconstruction error incurred by not representing some of the rarer features.

Another idea could be to try tuning the sparse regularisation weight $λ$ to avoid incentivising learning combinations of features. But reducing $λ$ reduces the penalty for representing features by arbitrary directions in the SAE rather than individual neurons. As discussed earlier, it isn’t clear that there ought to exist a sweet spot for $λ$ to achieve sufficient sparsity while avoiding learning feature combinations instead of individual features. The ubiquity of the phenomenon of feature splitting observed in Anthropic’s SAE paper despite trying a range of values for lambda suggest that such a sweet spot isn’t likely to exist.

I propose including an additional regularisation term in the SAE loss to penalise geometric non-orthogonality of the feature directions discovered by the SAE. One way to formalise this loss could be as the sum of the absolute values of the cosine similarities between each pair of feature directions discovered in the activation space. Neel Nanda’s findings here suggest that the decoder rather than the encoder weights are more likely to align with the feature direction as the encoder’s goal is to detect the feature activations, which may involve compensating for interference with other features.

If we train SAEs on the same target model both with and without this additional regularisation term we can then investigate whether geometric orthogonality really is a reasonable prior for the “true feature” directions. If it isn’t a reasonable prior then we would expect substantially worse reconstruction error. If it is a reasonable prior then we would expect it to discover features that weren’t discovered by the vanilla SAE because it used up its capacity on learning feature combinations.

Another test could be to look for features which seem to correspond to the individual constituents of apparent composite features e.g. if the vanilla SAE discovered the feature “the token ‘the’ in the context of mathematical prose” then we can look for features in the SAE with orthogonality regularisation which seem to correspond to the constituent concepts “the current token is ‘the’” and “the current context is mathematical prose”. An interesting question would then be whether the direction for the feature “the token ‘the’ in the context of mathematical prose” is approximately a linear combination of the directions for the features “the current token is ‘the’” and “the current context is mathematical prose”. If we consistently find for different groups of features that the feature directions obey the expected arithmetic then it would seem reasonable to conclude that feature splitting is an artefact of SAEs rather than a reflection on the structure of the space of the “true features”.

Depending on the weight of the orthogonality regularisation term we may still find that some feature combinations are discovered. Setting the weight too high may hinder the discovery of “true features” if they are sometimes somewhat aligned with one another while setting it too low would allow neurons to learn overlapping combinations of features, so this hyperparameter would require some tuning. It could also be beneficial to apply a non-linearity to the orthogonality loss to severely penalise more closely aligned feature directions whilst being more lenient on only slightly aligned feature directions, which could potentially be a more accurate reflection on the structure of the “true feature” directions.

Note that a naive implementation of this method would become extremely computationally intensive as we apply SAEs to larger target models with more “true features” and more dimensions in their activation vectors. The naive implementation to compute the orthogonality regularisation term would involve comparing every discovered feature direction to every other discovered feature direction. This would have quadratic computational complexity in the number of features we are trying to discover. A potential approach to tackle this could be to aim to discover features in smaller batches. After each batch of discovered features finishes learning we could freeze them and only calculate the orthogonality regularisation within the next batch, as well as between the next batch and the frozen features. Importantly we wouldn’t need to apply the regularisation within the already discovered features.

Edit: Chris_Leong correctly pointed out in the comments below that this batching algorithm would still have quadratic complexity overall. I’m still thinking about potential more efficient methods for implementing orthogonality regularisation. Something along the lines of neighbor lists as suggested by Charlie Steiner in the comments below could be a way forward. And perhaps having the already discovered ‘frozen’ features being stationary targets with the batching approach could be advantageous.

What links here?