Sparse Autoencoders (and other related feature extraction tools) often optimize for sparsity to extract human-interpretable latent representations from a model’s activation space. We show analytically that sparsity naturally leads to feature absorption in a simplified untied SAE, and discuss how this makes SAEs less trustworthy to use for AI safety with some ongoing efforts to fix this. This might be obvious to people working in the field—but we ended up writing a proof sketch so we’re putting it out here. Produced as part of the ML Alignment & Theory Scholars Program—Winter 2024-25 Cohort.

The dataset (a distribution with feature hierarchy)

In this proof, we consider a dataset $D$ with points sampled to exhibit features from a set of feature $F = {f_{1}, f_{2}, f_{3}, \dots, f_{d}}$ . Particularly, we will consider two features $(f_{1}, f_{2})$ that follow the hierarchy $f_{2} \subset f_{1}$ (think f_2 = elephant and f_1 = animal for instance), where existence of $f_{2}$ implies existence of $f_{1}$ .

Hierarchy in human-interpretable features is prevalent (and hard to study in LLMs). While other unrelated features still exist, for $f_{1}$ and $f_{2}$ , we can partition the probability of this dataset for four combinations:

Features	$f_{1}$	$\neg f_{1}$
$f_{2}$	$p_{11}$	$p_{01}$
$\neg f_{2}$	$p_{10}$	$p_{00}$

So these are the individual probabilities of a datapoint eliciting these combinations of features:

$p_{11} \equiv p_{f_{1}, f_{2}}$ (both features present, think elephant, which implies animal),
$p_{10} \equiv p_{f_{1}, \neg f_{2}}$ (only $f_{1}$ present, think cat or dog)
$p_{01} \equiv p_{\neg f_{1}, f_{2}}$ (only $f_{2}$ present, which should be zero, because of hierarchy)
$p_{00} \equiv p_{\neg f_{1}, \neg f_{2}}$ (neither feature present, maybe talking about volcanoes)

Each feature $f \in R^{d}$ is a vector with unit norm, and we assume that all features are mutually orthogonal, so $f_{a} \cdot f_{b} = 0 \forall f_{a}, f_{b} \in F$ . Each activation $h \in R^{d}$ in the model’s residual stream is a sum of all active features.

Training a Sparse Autoencoder (SAE)

Given a language model with residual stream activations $h \in R^{d}$ , the sparse autoencoder learns a mapping $f_{ϕ}$ such that $^h = f_{ϕ} (h)$ reconstructs $h$ . The total loss consists of a reconstruction term $L_{rec}$ , which minimizes the squared error, and a sparsity term $L_{sp}$ , enforcing sparsity via an $L_{p}$ -norm penalty on $^h$ . The model parameters $ϕ$ are optimized via gradient descent to minimize $L$ . The complete loss:

L = L_{rec} + λ L_{sp}

where $L_{rec} = E_{x \sim D} [∥^h - h ∥_{2}^{2}]$ (the reconstruction loss) and $L_{sp} = E_{x \sim D} [λ \sum_{i} | {^h}_{i} |^{p}]$ (the sparsity loss).

The SAE constructs latent activations $^h$ as $^h = W_{d} z$ , where $z = R e L U (W_{e} h)$ .

We consider encoders and decoders without biases for simplicity, and we look at the latents of the SAE that are related to the two features $f_{1}$ and $f_{2}$ , and show that absorption doesn’t affect reconstruction and increases sparsity, and that optimizing for sparsity pushes for higher absorption.

Evaluating the SAE loss under $δ$ -absorption

We use $z_{i}$ to denote the hidden activation for latent $i$ , $e_{i}$ to denote the SAE encoder for latent $i$ , and $d_{i}$ to denote the SAE decoder for latent $i$ . We assume the decoder is linear, that is, the reconstruction $^h$ is a linear combination of latents $^h = \sum_{i} d_{i} z_{i}$ . We assume that the first $2$ latents ( $z_{1}$ and $z_{2}$ ) track $f_{1}$ and $f_{2}$ without loss of generality.

We define $δ$ -absorption as part-way between no absorption ( $δ = 0$ ) and full absorption ( $δ = 1)$ . Thus, $e_{1} = f_{1} - δ f_{2}$ , $e_{2} = f_{2}$ , $d_{1} = f_{1}$ , and $d_{2} = f_{2} + δ f_{1}$ , and we look at the reconstruction and sparsity losses under varying amounts of absorption.

Reconstruction under $δ$ -absorption

Case 1: $h = f_{1}$ (parent feature only)

In this case, the parent feature fires on its own, so the SAE just needs to reconstruct $f_{1}$ .

z_{1} = ReLU ((f_{1} - δ f_{2}) h) = ReLU ((f_{1} - δ f_{2}) f_{1}) = 1

The first latent fires with magnitude 1, since $f_{1} \cdot f_{1} = 1$ but $f_{1} \cdot f_{2} = 0$ since $f_{1} ⊥ f_{2}$ .

z_{2} = ReLU (f_{2} z) = ReLU (f_{2} f_{1}) = 0

The second latent fires with magnitude 0 since $f_{1} \cdot f_{2} = 0$ ..

\begin{matrix} {^h}_{1} = z_{1} f_{1} = f_{1} {^h}_{2} = z_{2} (f_{2} + δ f_{1}) = 0^h = {^h}_{1} + {^h}_{2} = f_{1} \end{matrix}

The decoder output is thus just $f_{1}$ , perfectly reconstructing the input.

Case 2: $h = f_{1} + f_{2}$ (parent and child together)

In this case, the parent and child are firing together, so the SAE needs to reconstruct the sum of the parent and child features $f_{1} + f_{2}$ .

z_{1} = ReLU ((f_{1} - δ f_{2}) h) = ReLU ((f_{1} - δ f_{2}) (f_{1} + f_{2})) = 1 - δ

The first latent fires with magnitude $1 - δ$ since $f_{1} \cdot f_{1} = 1$ and $f_{2} \cdot f_{2} = 1$ , but $f_{1} \cdot f_{2} = 0$ since $f_{1} ⊥ f_{2}$ . We have $z_{2} = ReLU (f_{2} z) = ReLU (f_{2} (f_{1} + f_{2})) = 1$ .

The second latent fires with magnitude 1 since $f_{2}$ is present in $h$ .

\begin{matrix} {^h}_{1} = z_{1} f_{1} = (1 - δ) f_{1} = f_{1} - δ f_{1} {^h}_{2} = z_{2} (f_{2} + δ f_{1}) = f_{2} + δ f_{1}^h = {^h}_{1} + {^h}_{2} = f_{1} + f_{2} \end{matrix}

The decoder output sums to $f_{1} + f_{2}$ , again perfectly reconstructing the input.

Case 3: $h = 0$ (nothing fires)

Here, $z_{1} = ReLU ((f_{1} - δ f_{2}) h) = 0$ and $z_{2} = ReLU (f_{2} z) = 0$ .

When neither feature is present, neither latents fire:

\begin{matrix} {^h}_{1} = z_{1} f_{1} = 0 {^h}_{2} = z_{2} (f_{2} + δ f_{1}) = 0^h = {^h}_{1} + {^h}_{2} = 0 \end{matrix}

And thus the decoder output is still 0, achieving perfect reconstruction.

We do not need to consider the case where $f_{2}$ fires alone, as this is not allowed by our absorption setup. As we see above, in all cases, any level of absorption $δ$ still achieves perfect reconstruction.

Intuitively, absorption does not hinder reconstruction because knowing the elephant feature is active is enough to infer animal even if that feature gets absorbed.

Sparsity under $δ$ -absorption

We calculate the sparsity loss as follows:

L_{s p} = \sum x \sim D (\sum f \in h L_{p} ({Enc}_{ϕ}^{f} (x)) + \sum f \notin h L_{n} ({Enc}_{ϕ}^{f} (x))

where $h = {f_{1}^{'}, f_{2}^{'}}$ represents the two features that are related to $f_{1}$ and $f_{2}$ , and $L_{n}$ is the norm (to disambiguate with $p$ ). Since the representation and sparsity of the unrelated features stays the same, we look at this value of sparsity and how it changes with the amount of absorption in our data distribution.

Case 1: $δ = 0$ (no absorption)

With no absorption (which means the SAE learned the true features $f_{1}$ and $f_{2}$ in both the encoder and decoder), the sparsity comes out to be (with $n$ -norm, where we assume $f_{e n c} \in {0, 1}$ for simplicity):

L_{s p} = p_{11} \cdot 2^{1 / n} + p_{10} + L_{\notin h}

Case 2: $δ = 1$ (absolute absorption)

With absolute absorption of a feature (all datapoints exhibiting it), with the decoder learning $f_{2}$ and $f_{2} + f_{1}$ respectively for the encoder features $f_{1} ∖ f_{2}$ (exclusion due to absorption) and $f_{2}$ . In this case, we get the following sparsity loss:

L_{s p} = p_{11} + p_{10} + L_{\notin h}

Case 3: General Case (some $δ$ amount of absorption)

Finally, with an arbitrary amount of absorption of $f_{1}$ into $f_{2}$ , we get:

L_{s p} = p_{11} \cdot (2 - δ)^{1 / p} + p_{10} + L_{\notin h}

It is evident that absorption leads to higher sparsity in encoder activations for hierarichical features. Now, we show that minimizing sparsity via differentiating naturally leads to increasing absorption.

Differentiating the Sparsity Loss

Assume that the sparsity penalty is given by an $L_{n}$ norm on the encoder activation:

L_{sp} (δ) = E_{x} [| z (x) |^{p}] .

We can see that the derivative of the $L_{p}$ sparsity loss with respect to $δ$ promotes absorption:

\frac{d L_{s p}}{d δ} = - \frac{p_{11}}{p} (2 - δ)^{\frac{1}{p} - 1}

This derivative is always negative as long as $p_{11} > 0$ , i.e., datapoints with both features exist in our distribution, so increasing sparsity (decreasing the loss) always increases absorption. Next, we show that absorption makes SAEs (and other related feature extractions) much less trustworthy for safety.

Absorption makes SAEs less trustworthy for safety

The hope with SAEs (and other related research) is that we extract model-internal latents representing human-interpretable features that we can detect and control. For this to be usable, we want this to work successfully for complex, harmful behaviours such as lying, strategic deception, power-seeking, sycophancy, backdoors, etc.

Absorption, and sparsity in feature-extraction which promotes absorption, means that even if we find linear latents for these features (for which we don’t really have good progress in the first place [TODO: cite], we can’t trust them.

A feature for deception, with absorption, might actually have learned deception-except-deception-in-2027, or a feature with power seeking might have learned power-seeking-except-instrumental-convergence. Unless we fix our feature extractors to not optimize to maximise just sparsity in latent activations (some other ideas and starting points include KL-divergence and parameter-space related directions), we can’t trust our SAEs, and their features can potentially be more misleading and harmful than helpful.

An example of feature absorption in real-world models

In the original feature absorption paper, absorption is shown to occur in real SAEs trained on LLMs. SAE latents that seem to track starting letter information (e.g. a feature that fires on tokens that start with $S$ ) fails to fire on seemingly arbitrary tokens that start with (like the token "_short"). The paper shows that this is due to more specific latents “absorbing” the “starts with $S$ ” direction.

In feature absorption, we find gerrymandered SAE latents which appear to track an interpretable concept but have holes in their recall. Here, we see the dashboard for Gemma Scope layer 3 16k, latent 6510 which appears to track “starts with S”, but mysteriously doesn’t fire on “_short”.

The original paper hypothesizes that feature absorption is a logical consequence of the sparsity penalty in SAE training, but we now have a proof that naively optimizing an SAE for sparsity will indeed lead to feature absorption.

How much of this applies to other SAE variants and feature-extraction methods?

We should expect feature absorption in any SAE architecture that incentivizes sparsity, which at present is all common architectures. JumpReLU, TopK, and standard L1 SAEs all extract feature representations by making use of sparsity and we should thus expect them all to engage in feature absorption.

Cross-layer SAE variants such as crosscoders^[1] and transcoders^[2] also rely on sparsity to extract features and we should thus also expect that these new architectures will also suffer from feature absorption. Indeed, recent work on transcoders finds they do experience feature absorption^[3].

Matryoshka SAEs to fix Absorption

Matryoshka SAEs are a promising approach to fixing absorption in SAEs. These SAEs encode a notion of hierarchy by forcing earlier latents to reconstruct the full output on their own, making it more difficult for parent latents to have holes in their recall for child features.

^
Sparse Crosscoders for Cross-Layer Features and Model Diffing [link], Jack Lindsey, Adly Templeton, Jonathan Marcus, Thomas Conerly, Joshua Batson, Christopher Olah, Transformer Circuits Thread 2024
^
Transcoders Find Interpretable LLM Feature Circuits [link], Jacob Dunefsky, Philippe Chlenski, Neel Nanda, arXiv:2406.11944, 2024
^
Transcoders Beat Sparse Autoencoders for Interpretability [link], Gonçalo Paulo, Stepan Shabalin, Nora Belrose, arXiv:2501.18823, 2025

Sparsity is the enemy of feature extraction (ft. absorption)

The dataset (a distribution with feature hierarchy)

Training a Sparse Autoencoder (SAE)

Evaluating the SAE loss under δ-absorption

Reconstruction under δ-absorption

Case 1: h=f1 (parent feature only)

Case 2: h=f1+f2 (parent and child together)

Case 3: h=0 (nothing fires)

Sparsity under δ-absorption

Case 1: δ=0 (no absorption)

Case 2: δ=1 (absolute absorption)

Case 3: General Case (some δ amount of absorption)