TL;DR

Sparse autoencoders (SAEs) presents us a promising direction towards automating mechanistic interpretability, but it not without flaws. One known issue of the original sparse autoencoders is the feature suppression effect which is caused by the conflict between the $L_{2}$ and $L_{1}$ loss and the unit norm constraint on the SAE decoders. This effect in theory will be more evident when we have inputs that have high norms. Another observation is that training SAEs on multiple layers simultaneously results in inconsistent $L_{0}$ norms for feature activations across layers: in some layers, $L_{0}$ has scale of $10^{2}$ , while in some other layers it has a scale of $10^{1}$ . Moreover, the residual states that’s inputed to the SAEs for training also have different norms across layers. Hence, I argue that the current SAE architecture is not robust against inputs of varying norms, which is commonly the case in modern LLMs. In this post, I a modified SAE architecture, namely Normalized Sparse Autoencoder (NSAE), and gave a theoretical proof that it will not have the feature suppression problem. I then conducted experiments to verify the effectiveness of the proposed method, which showed that:

Feature suppression is suppressed in NSAEs
The normalization removed the correlation between layer mean input norm and $L_{0}$
The normalization makes $L_{1}$ agrees with $L_{0}$ better

I then further investigated the learned feature dictionaries and identified 3 types of feature vectors: the correction vector, the pillar vector, and the direction vector. I then concluded this post with discussion on the limitations of NSAEs and gave my suggestions on future directions.

Introduction

Training Sparse Autoencoders (SAEs) on the residual states of pretrained models is a recently proposed method in mechanistic interpretability to tackle the problem of superposition. This method is scalable and unsupervised, making it promising for auto-interpretability research.

More specifically, a SAE contains an encoder and a decoder. It is trained to generate sparse feature activations from the original residual states of a source model through the encoder, and reconstruct the residual state through a decoder. It is expected that by training the SAE with a large set of activations jointly optimizing for a sparsity loss on the feature activations and a $L_{2}$ reconstruction loss, the model can learn to decompose residual states into monosemantic feature vectors that are more interpretable.

In this post, I identified a flaw in the original SAE implementation, namely inconsistency of the $L_{1}$ loss across layers, and proposed a method to mitigate this problem. With the new method, we can significantly decrease the correlation between the norm of the source model’s residual activations and the $L_{0}$ norm of the feature activations, making the training process more robust and controllable. The code is available on GitHub (notice that you should use the dev branch instead of others).

Motivations

Feature suppression is a known problem for SAEs. It originated from a conflict between the $L_{1}$ sparsity loss and the $L_{2}$ reconstruction loss, as the reconstruction’s norm is correlated with $L_{1}$ , and the SAE model learns to generate a reconstruction with smaller norm for a better $L_{1}$ loss. This is not desirable, as we would like the reconstruction to best correspond to the original input activations. Therefore, finding a way to disentangle the input norms from $L_{1}$ and $L_{2}$ is beneficial.

Also, in my personal experiments with training SAEs using this implementation from the AI Safety Foundation, I observed an inconsistency of the $L_{1}$ sparsity loss across layers:

Figure 1a. The $L_{1}$ loss of the activations in the layer indexed 1.

Figure 1b. The $L_{1}$ loss of the activations in the layer indexed 10.

The above two figures are the $L_{1}$ losses of two different layers from the same training run, but the scale of $L_{1}$ has a $10 \times$ difference.

Moreover, the sparsity measured by $L_{0}$ is also vastly different across layers:

Figure 2a. The $L_{0}$ norm of the activations in the layer indexed 1.

Figure 2b. The $L_{0}$ norm of the activations in the layer indexed 10.

I argue that this is also undesirable, as we introduced the $L_{1}$ coefficient $α$ in attempt to control the balance between the $L_{1}$ and $L_{2}$ loss across layers. Ideally, $α$ should have consistent control across layers, which is not the actual case.

Moreover, there is an inconsistency of the norms of the source model’s residual states across layers. We can plot the distribution of residual states^[1] norms in GPT-2 small across layers:

Figure 3. The norm distribution of residual states in different layers of the residual stream of GPT-2 small during inference.

It is obvious that the mean and variance of the norms differ across layers.

This effect is common among LLMs, and we can find similar effects in more recent models like LLaMA-2 and Gemma:

Figure 4a. The norm distribution of residual states in different layers of the residual stream of LLaMA2-7B during inference.

This provides some evidence that the inconsistency of input norms might have caused the undesirable behaviors in SAEs. Thus, I will conduct a theoretical analysis in the next section to further illustrate this problem.

Theoretical Analysis

Definitions

With these observations in mind, let’s do a theoretical analysis on this loss to see why they might have happened.

Formally, a SAE can be defined as the following:

$Encoder (x) = ReLU (W_{e} x + b_{e}) = c$

$Decoder (c) = W_{d} c$

$x^{'} = SAE (x) = Decoder (Encoder (x))$

We denote the output of encoder as the feature activation $c$

The loss function for optimization is defined as

$L_{1} = | | c | |_{1}$

$L_{2} = | | x^{'} - x | |_{2}$

$L (x^{'}, x) = α L_{1} + L_{2} = α | | c | |_{1} + | | x^{'} - x | |_{2}$

where the $L_{1}$ coefficient $α \in R^{+}$ is a hyperparameter of the user’s choice and $| | \cdot | |_{k}$ is the k-norm of a given vector.

We set another hyperparameter expansion factor $k \in N^{+}$ and denote the source model’s residual dimension as $n$ . Then we can define $m = k n$ and we have $x, x^{'} \in R^{n}$ , $c, b_{e} \in R^{m}$ , $W_{e} \in R^{m \times n}$ , and $W_{d} \in R^{n \times m}$ .

In the original implementation, the authors constrained the decoder to have unit norm column vectors, so that during the optimization process the model won’t minimize the $L_{1}$ loss by increasing the column norms of the decoder and learn to generate dense feature activation of small $L_{1}$ . This design choice lead to a potential flaw in the method and will be discussed in a later section of this post.

The Effect of Input Norms on Feature Suppression

The authors who identified feature suppression have provided a nice theoretical analysis in the Feature Suppression section, but for the comprehensiveness of this post, I will conduct a similar analysis using the terms defined in this post.

We first consider the extreme case where an input $x$ has a feature activation $c$ that only has one positive entry $i$ , with all other entries equal to 0. Then we have $x^{'} = Decoder (c) = W_{d} c = c_{i} w_{d, i}$ where $w_{d, i}$ is the $i$ -th column vector of $W_{d}$ . Since $W_{d}$ is column normal, we must have $| | x^{'} | |_{2} = | | c_{i} w_{d, i} | |_{2} = c_{i} = | | c | |_{1}$ .

More generally, I will show that when $| | c | |_{1}$ is sparse, we also have $| | c | |_{1} \approx | | x^{'} | |_{2}$ .

Define $I = {i : c_{i} \neq 0}$ the index set of all nonzero entries in the feature activation. Then we assume that the feature vectors in the set ${w_{d, i} : i \in I}$ are (almost) mutually orthogonal^[2], which is $\forall i, j \in I, i \neq j, w_{d, i} \cdot w_{d, j} \approx 0$ . By the constraint that the decoder have unit norm, which is $\forall i \in I, w_{d, i} \cdot w_{d, i} = 1$ , we have

| | x^{'} | |_{2} = \sqrt{x^{'} \cdot x^{'}} = \sqrt{\sum i \in I c_{i} w_{d, i} \cdot \sum i \in I c_{i} w_{d, i}} = \sqrt{\sum i \in I \sum j \in I c_{i} c_{j} w_{d, i} \cdot w_{d, j}} \approx \sqrt{\sum i \in I c_{i}^{2}} = | | c | |_{2}

In the case of sparse $c$ , we have $| | c | |_{1} ⪆ | | c | |_{2} \approx | | x^{'} | |_{2}$ .

Then our loss function becomes the following:

$L = α | | x^{'} | |_{2} + | | x^{'} - x | |_{2}$

If we attempt to minimize this loss, there is always a tradeoff between the reconstruction accuracy and the norm of the reconstruction. In most cases, the model will learn to construct $x^{'}$ that’s close enough to $x$ but slightly smaller than $x$ to achieve low losses in both terms.

The Effect of Input Norms on the Inconsistency of $L_{0}$ Across Layers

Here, we make the similar assumption that when $| | c | |_{1}$ is sparse, we have $| | c | |_{1} \approx | | x^{'} | |_{2}$ .

For the $L_{2}$ term, we have

$| | x^{'} - x | |_{2} = | | x^{'} | |_{2}^{2} + | | x | |_{2}^{2} - 2 x^{'} \cdot x = | | x^{'} | |_{2}^{2} (1 + \frac{| | x | |_{2}^{2}}{| | x^{'} | |_{2}^{2}} - \frac{2 x^{'} \cdot x}{| | x^{'} | |_{2}^{2}})$

At first glance, this might not be obvious, but if our reconstruction $x^{'}$ is similar enough to $x$ , we can take $| | x^{'} | |_{2} \approx | | x | |_{2}$ ^[3]and the equation simplifies to

$| | x^{'} - x | |_{2} = | | x | |_{2}^{2} (1 + 1 - 2 \frac{x^{'} \cdot x}{| | x | |_{2} \cdot | | x^{'} | |_{2}}) = 2 | | x | |_{2}^{2} (1 - c o s (x^{'}, x))$

Now we can rewrite our loss:

$L (x^{'}, x) = α | | x^{'} | |_{2} + 2 | | x^{'} | |_{2}^{2} (1 - c o s (x^{'}, x))$

Notice that, if $1 - c o s (x^{'}, x)$ is in a relatively fixed scale, then the first term has a scale of $| | x^{'} | |_{2}$ while the second term has a scale of $| | x^{'} | |_{2}^{2}$ . Then , given a fixed $α$ , if we have a larger $| | x | |_{2}$ , the loss term will bias towards the second term, which agrees with the observation I had earlier: the source model’s residual states in deeper layers have larger norms than shallower layers, and the $L_{1}$ loss was significantly higher in deeper layers as the loss was dominated by the larger $L_{2}$ term.

Normalizing SAEs

After such analysis, it natural for us to ask: is there a way to solve these problems?

My answer is yes!

Here, I propose an architectural modification to the original SAE architecture, which I have named the Normalized Sparse Autoencoder (NSAE).

Architecture

The modified architecture is defined as the following:

$c = tanh (ReLU (W_{e} x + b_{e} + ϵ))$

$NSAE (x) = W_{d} c$

In this definition, $c$ is the new feature activation, and $W_{d}$ is no longer constrained to unit norm. A Gaussian error term $ϵ$ is introduced to regularize the feature activation, which is sampled from $N (0, σ)$ for some hyperparameter $σ$ .

The introduction of tanh normalizes every entries of $c$ to the range of $[0, 1)$ . The benefits of doing this are threefolds:

This makes $c$ independent of the norm of the input, hence theoretically prevents feature suppression.
When the entries of $c$ are in the range of $[0, 1)$ , $| | c | |_{0}$ and $| | c | |_{1}$ are much closer, making the $L_{1}$ loss a more accurate measure of sparsity.
The decoder learns features with norms, which can potentially leads to better interpretability as we can now consider both directions and norms.

The Gaussian noise term is also essential in this architecture. Without it, the model can learn to minimize $L_{1}$ by learning to map to very small positive values in the feature activation space and learn decoders with extremely large column norms.

To show why adding Gaussian noise solves this problem, I plotted the activation in the following figure:

Figure 5. The tanh(ReLU(x)) function and the ranges that different ranges of inputs maps to. For large inputs, the input range maps to a very small region on the y-axis, meaning that perturbations in that range do close to no change to the output, while smaller inputs are much more sensitive to perturbation.

From the figure, we can see that when the inputs are small, the output of tanh(ReLU) will be relatively sensitive to the input, and adding Gaussian noise can significantly perturb small feature activations. On the contrary, larger inputs to the activation function are much more robust to perturbation, as they all maps to similar values close to $1$ . Hence, this perturbation forces the model to learn to generate feature activations that are either strictly 0 or close to 1, which makes $| | c | |_{1}$ behave even more like $| | c | |_{0}$ , especially when we set $σ$ to be large.

Loss

We also have to redefine the loss as follows:

$X_{i} = {x : x is in the i -th layer of the input batch}$

$β_{i} = α \cdot mean (X_{i})^{2}$

$L_{i} (x^{'}, x) = β_{i} | | c | |_{1} + | | x^{'} - x | |_{2}$

We introduced the additional step of scaling $α$ by the square of the mean of the input norm of one layer. This is because $| | x^{'} - x | |_{2} \approx 2 | | x^{'} | |_{2}^{2} (1 - c o s (x^{'}, x))$ . If we assume that the best an optimizer can do is to achieve a fixed cosine similarity between $x$ and $x^{'}$ without the $L_{1}$ constraint, then we can treat the $(1 - c o s (x^{'}, x))$ term as a constance, so the $L_{2}$ loss is of the scale $| | x^{'} | |_{2}^{2} \approx | | x | |_{2}^{2}$ , while $| | c | |_{1} \approx | | c | |_{0}$ which should be constant across layers. Therefore, we can manually scale the $L_{1}$ loss to match the scale of the $L_{2}$ loss. Another way to scale the loss is by using the actual $| | x | |_{2}^{2}$ of the given sample. Theoretically this might cause the model to overfit to inputs of large norms, but for the conciseness of this post, I will leave this problem for future work to investigate, and only use the mean normalization for all the following experiments.

Experiments

I trained two groups of SAEs, one baseline and one experiment, on all layers of GPT2, and each group contains 2 training runs trained on $100, 000, 000$ activations. These four runs used different sets of $L_{1}$ coefficient and learning rate, and the baseline used the original SAE while the experiment used the normalized SAE. I will use “the experiment group” and “the normalized group” interchangeably.

Feature Suppression is Suppressed in Normalized SAE

To investigate feature suppression, I added a new verification metric that measures the ratio between the norm of reconstructions and norm of source activations. Here is this measure during training:

Figure 6. Mean feature suppression ( $| | x^{'} | |_{2} / | | x | |_{2}$ ) during training, higher is better.

Clearly, the normalized group has significant higher score on feature suppression than the experimental group, and that score is very close to one. Considering the fact that this NSAE didn’t fully converge as it only went through 200M training examples, and there is not a sign of this score to flatten, I claim that NSAEs have less to none feature suppression.

Normalizing $L_{1}$ Removes the Correlation Between Input Norm and $L_{0}$

To investigate the effect of normalization, I collected the $L_{0}$ norms of different layers during the end of training and plotted them against the mean input norms of the layer:

Figure 7. The correlation between mean input norms and the mean $L_{0}$ norm of the feature activation.

The red and blue datapoints are from the baseline group whereas the cyan and purple datapoints are from the experiment. We can fit lines to these datapoints to find linear relationships between the mean input norm and the mean $L_{0}$ norm of the feature activations. Although the fitting is not good, the fitted lines still show a rough positive linear correlation between the mean input norm and the feature activation $L_{0}$ norm in the baseline. In contrast, the two normalized samples did not exhibit a statistical significant positive linear relationship between input norm and $L_{0}$ .

This linear fit definitely does not look satisfactory, and I further investigated the reasons behind it. I plotted the normalized group’s $L_{0}$ against layer index, and here is what it looks like:

Figure 8. The correlation between layer and the mean $L_{0}$ norm of the feature activation.

I conjecture that $L_{0}$ in the normalized group reflects a level of discreteness of the activations of the source model, as it exhibit an increase-then-decrease pattern. In the source model, earlier activations are more discrete as they originated from discrete input embeddings, and as deeper activations might be less discrete as they aggregate information. In the last layers, as the model has to make the next token prediction as accurate as possible, the activations might become more discrete again for better next-token decoding since the decoding layer is discrete. This discreteness might also be positively correlated with the monosemanticity of the activations, as more discrete activations are often more interpretable. I will not verify this conjecture in this post due to length considerations, and I welcome other to study this problem.

$L_{1}$ Agrees with $L_{0}$ Better

To investigate the agreement between $L_{1}$ and $L_{0}$ , I plot the mean $L_{0}$ and $L_{1}$ of the feature activations for both groups:

Figure 9. Agreement between $L_{0}$ and $L_{1}$ . The thing that matters is the distance between two lines of the same color.

Clearly, the cyan and purple solid lines (which are $L_{1}$ ) are much closer to their corresponding dashed lines ( $L_{0}$ ) than the baselines, indicating better agreement between $L_{0}$ and $L_{1}$ .

Performance Validation

To validate that the normalization did not heavily impact performance, I present the reconstruction score metric. I first calculate the loss of no intervention, zero intervention (replacing hidden states in one layer with zero vectors), and reconstruction intervention (replacing hidden states in one layer with reconstructed vectors from SAE), and I will denote them as $L_{clean}$ , $L_{zero}$ , and $L_{reconstruction}$ , respectively. Then, the score is calculated by

$S_{reconstruction} = \frac{L_{zero} - L_{reconstruction}}{L_{zero} - L_{clean}}$

Since we expect $L_{zero}$ to be higher than $L_{clean}$ , and we want $L_{reconstruction}$ to be close to $L_{clean}$ , so higher score is better, and we expect a value close to $1$ . The score during training is show below:

Figure 10. Mean reconstruction score during training.

There is no observable difference between the normalized group and the baseline group except that the normalized group’s score seems slightly more stable during training, indicating that the normalization did not heavily impact performance but might improved training stability.

Since the mean reconstruction score is heavily impacted by the sparsity of the feature activation, I also compared a layer where the $L_{0}$ of the baseline and experiment group best agrees with each other:

Figure 11a. $L_{0}$ norm of layer 5 for experiment and baseline.

Figure 11b. Reconstruction score of layer 5 for experiment and baseline.

Still, there is not an observable difference between the experimental group and the baseline after convergence. This provided further evidence that the normalization did not have a observable negative impact on the performance of SAEs.

NSAE Statistics

To further investigate what the new SAE has learned, I did some statistical analysis on the NSAE feature dictionary from the first run. For comparison, I used the original SAE trained in the first baseline run.

I first analyzed the norm distribution of the feature vectors along the layers:

Interestingly, a large proportion of feature vectors have norms in the range of $(0, 0.5)$ , which might indicate that these vectors are small correction vectors that are added to a bigger vector to make the prediction as close as possible. In contrast, I hypothesize that feature vectors of norms that have high mean activation norm should have good interpretability as they represent general directions to the reconstruction. Hence, I will name these vectors as the pillar vectors.

Next, I calculate the distribution of cosine similarity of the feature dictionary:

From the figure, it’s obvious that the cosine similarity distribution of NSAE and SAE are very similar except that in NSAE there are some cosine similarity very close to one. my hypothesis to these vectors is that in NSAE, there are some direction vectors that appears frequently in different norms in the decomposition of source model activations, so that NSAE have to learn these vectors of the same direction in different norms.

A natural question to ask is that: do pillar vectors and direction vectors overlap? To answer this, I picked the top- $100$ vectors (in terms of norm) of each layer from the feature dictionary as a set of pillar vectors and calculated their cosine similarity, and here is the distribution:

Figure 14. Distribution of cosine similarity for high-norm feature vectors (pillar vectors)

Since the are little to none vectors that have very high cosine similarity, there is minimal overlap between pillar vectors and direction vectors.

As this post is already pretty long, I will leave a more comprehensive analysis on the learned feature dictionary to a future post and conclude this post.

Discussion

Limitations

The normalization did not come without cost. NSAEs generally have slightly higher reconstruction losses compared with the original, and it takes longer for NSAE to converge, as shown in the following figure:

Figure 15. L2 reconstruction loss during training, lower is better.

I suspect the reason of this is because NSAE learns a non-unit norm dictionary, and this dictionary have to capture all the norm information with a fixed size, whereas the original SAE can learn directions and add norm information through the feature activations.

Another metric that I don’t know how to interpret is the neural activity. In NSAE, the neural activity are significantly higher than the original SAE:

Figure 16. Neuron activity for baseline and experiment groups.

Lastly, the experiments conducted are relatively small in scale due to limitations in compute. Moreover, due to the change of the loss function, it’s hard to directly match the scales of $L_{0}$ between the baseline and the experiment group.

Future Work

I suggest future work to go along the following directions:

Investigate other factors that might caused the $L_{0}$ inconsistency across layers. I proposed a conjecture that it might be the difference in discreteness of source model input activations across layers that caused this inconsistency.
Interpret the learned feature dictionary of NSAE. Future work can further investigate the feature vectors, especially the pillar vectors and direction vectors, and find interpretations for them.

Appendix

Hyperparameters

I varied the hyperparameters l1_coefficients and the optimizer learning rate lr. For the two normalized groups, I also set the standard deviation of the Gaussian noise $σ$ .

	baseline 1	baseline 2	normalized 1	normalized 2
l1_coefficient	0.001073	0.0009642	0.00004065	0.0000965
lr	0.0006275	0.00005584	0.0009045	0.000657
$σ$	N\A	N\A	1	1

Table A1. Hyperparameters used for training that varied for different runs

expansion_factor	16
context_size	256
source_data_batch_size	16
train_batch_size	4096
max_activations	100,000,000
validation_frequency	5,000,000
max_store_size	100,000
resample_interval	200,000,000
n_activations_activity_collate	100,000,000
threshold_is_dead_portion_fires	1e-6
max_n_resamples	4
resample_dataset_size	100_000
cache_names	blocks.{layer}.hook_mlp_out

Table A2. Fixed hyperparameters for all runs

Related Work

Riggs et. al. proposed to use Sparse Autoencoders (SAEs) to discover interpretable features in large language models. Later, Wright et. al. identified the Feature Suppression effect in SAEs and argued that the $L_{1}$ loss induced smaller feature activations that harmed reconstruction performance. Wes Gurnee observed that the reconstruction errors in SAEs are empirically pathological, and compared different norm-aware interventions to the source model’s inference. Results show that replacing the original residual state with SAE significantly changed the model’s predictions, especially in deeper layers.

^
In this and the following examples, I used the residual states from the MLP layer.
^
This is a reasonable assumption, as data in Figure 13 (baseline) show that most feature vector pairs in the original sparse autoencoder have cosine similarities in the range of $(- 0.2, 0.2)$ .
^
Empirically, $| | x^{'} | |_{2} \approx 0.9 | | x | |_{2}$ , which is close enough for our analysis.
^
For computational efficiency, I randomly sampled $1, 000, 000$ features from the cosine similarity matrix.
^
$L_{0}$ collected from step=3000. Input norm sampled from a relatively small sample of random text. This text is the same as the text used to generate figure 3, 4a, and 4b.