A post in our series “Feature Geometry & Structured Priors in Sparse Autoencoders”
TL;DR: We frame saprse autoencoding as a latent variable model (LVM) and inject simple correlated priors to help untangle latent features in a toy model scope. On synthetic benchmarkes as in “Toy Model of Superposition”, a global-correlation prior (ρ=0.8) yields far cleaner feature recovery than the isotropic VAE baseline, validating our variational framework.
About this series
This post kicks off a multi-part exploration of feature geometry in large-model embeddings and how to bake that geometry into priors for sparse autoencoders (SAEs). We’ve been working since February through the SPAR 2025 program with a fantastic group of mentees, bringing together tools from probability, geometry, and mechanistic interpretability.
Our approach rests on two core intuitions:
Variational framing of SAEs. By casting SAEs as latent-variable models, we can replace ad-hoc L1/TopK penalties with ELBO losses under structured priors p(z), giving a clear probabilistic handle on feature disentanglement.
Feature-space geometry. Real model activations exhibit rich geometric structures; we aim to discover and then encode these structures—via block-diagonal and graph-Laplacian covariances—directly into our priors.
Together, these ideas form a systematic framework for building and evaluating SAEs with inductive biases matched to the true geometry of model features.
Series Table of Contents
➡️ Part I (you are here): Toy model comparison of isotropic vs global-correlation priors in V-SAE
Part II: Block-diagonal & graph-Laplacian structures in LM embeddings
Part III: Crosscoders & Ladder SAEs (multi-layer, multi-resolution coding)
Part IV: Revisiting the Linear Representation Hypothesis (LRH) via geometirc probes
0. Background Story
Imagine stepping into an enormous library where every book—on wolves, quantum mechanics, or Greek mythology—has been crammed onto a few tiny shelves. You know the books are there, but they’re superimposed in a chaotic mass, impossible to disentangle. This is exactly the challenge of polysemanticity and superposition in neural activations: too many “features” packed into too few dimensions. Mechanistic interpretability is our librarian’s toolkit for reverse-engineering that mess—cracking open the model, identifying its hidden “gears”, and figuring out how each concept is represented. In this first toy experiment, we cast sparse autoencdoers (SAEs) as a latent-variable cataloguer, injecting a simple global-correlation prior so that related books instinctively group onto the same shelf. The result? A beautifully organized, block-structured dictionary of features—proof that with the right probabilistic prior, even the messiest shelf can become a masterfully curated collection.
I. Introduction: Why SAEs & Monosemantic Features?
A growing body of work has begun to chart the geometry of learned representations—and to use that map to guide feature learning:
Toy Models of Superposition (2022) introduced simple synthetic benchmarks illustrating the superposition phenomenon where more features (K) than dimensions (N) force “stacking” of concepts and lead to polysemantic directions. Established the need for specialized dictionary-learning methods.
Towards Monosemanticity: Decomposing LMs with Dictionary Learning (2023)first applied SAEs to LLM residual-stream activations, demonstrating that a learned overcomplete basis can uncover monosemantic feature vectors far more interpretable than individual neurons. Identified feature splitting and entanglement failure modes.
Classic SAEs learn a dictionary Wd to reconstruct activations x∈Rd under a hard-sparsity constraint
minWd,f∥x−Wdf(x)∥22s.t.∥f(x)∥0≤K.
These monosemantic features often align with human-interpretable concepts. Yet when true factors are correlated, SAEs fragment (“feature splitting”) or absorb one factor into another, giving polysemantic atoms that obscure mechanistic insight.
II. Variational SAE (V-SAE): Model & Assumptions
2.1 Latent-Variable Formulation
We treat the encoder-decoder as a probabilistic model:
p(x,z)=N(x;Wdz,σ2xI)×p(z)
where z is the latent feature, Wd∈Rd×k is a learned decoder matrix and σ2x is fixed. The first term resembles the decoder and the seconder resembles the prior.
We introduce a Gaussian encoder
qϕ(z|x)=N(μϕ(x),diag(σ2ϕ(x)))
where qϕ is a learnable Gaussian encoder, which outputs mean and diagonal log-variance.
2.2 ELBO Objective
Inspired by variational inference methods, we propose Variational SAEs (V-SAEs). Specifically, we derive the training objective from the Evidence Lower Bound (ELBO) on logpθ(x), which we write as the loss:
minWd,ΦEqϕ(z|x)∥x−Wdz∥2+αKL(qϕ(z|x)∣∣∣∣p(z)),
The hyperparameter α controls the trade-off econstruction accuracy against the strength of the prior.
2.3 Structured Priors Design
In this toy-model study, we compare two Gaussian priors p(z)=N(0,Σp):
Isotropic prior: Σp=σ2pIk
Full-covariance prior: Σp is a free, positive-definite matrix learned (via Cholesky parameterization) alongside Wd.
These two extremes let us isolate the effect of allowing arbitrary latent correlations (full) versus assuming none (iso).
In summary:
Generative model:
p(x,z)=N(x;Wdz,σ2xI)×N(z;0,Σp)
Variational posterior:
qϕ(z|x)=N(μϕ(x),diag(σ2ϕ(x)))
Training loss (negative ELBO)
L(x)=Eqϕ[∥x−Wdz∥2]+βKL(qϕ(z|x)∣∣∣∣N(0,Σp)).
III. Toy Model Experiments We compare two bottleneck models on synthetic superposition benchmarks:
SAE (Baseline): Traditional sparse autoencoder with an ℓ1 penalty on the latent activations f(x).
VAE (V-SAE Proposal): A variational autoencoder with a sparsity-promoting prior p(z)=N(0,Σp).
3.1 Case Categories
We generate three families of toy datasets, each designed to stress different aspects of correlated superposition:
Basic Cases (Case 1–2):
Case 1: Two orthogonal latent directions of unequal variance (variance ratio 1 : 0.2).
Case 2: Two latent directions at 45° (equal variance), testing angular disentanglement.
Case 3: A set of ncorr=3 features all correlated at 0.8.
Case 4: A set of nanticorr=3 features all anti-correlated at –0.8.
Case 5: Mixed correlated and anti-correlated groups.
Full Correlation Matrix (Case 6):
Draw a random k×k positive-definite covariance matrix with specified off-diagonal structure, then sample latents accordingly.
In each case we sample N=10,000 points z∼N(0,Σtrue), project to x=Wtruez with additive Gaussian noise.
3.2 Training Configuration
Encoder/Decoder: 2-layer ReLU MLP (32→64→232 dims for mean and log-variance) → diag-Gaussian posterior → linear decoder
Objective: ELBO with α=1.0 (reconstruction MSE + KL term)
Optimizer: Adam with learning rate 10−3, batch size 256,200 epochs
We compare exactly two priors p(z)=N(0,Σp):
Iso: Σp=σ2pI32
Full: Σp a free cholesky-parameterized covariance
3.3 Evaluation Metrics
We report the two statistics on a held-out test set:
Reconstruction MSE 1N∑Nn=1∥xn−^xn∥22.
Latent Sparsity The average fraction of non-zero entries in the bottleneck activations f(x) (or posterior mean for the VAE).
KL Term (VAE only) The final average KL(q(z∣x)∥p(z)) to verify the prior is being enforced.
3.4 Results: Block-Correlated Setting
Basic Cases
Case
SAE MSE
VAE MSE
SAE Sparsity
VAE Sparsity
1
0.015
0.010
8 / 32
12 / 32
2
0.020
0.014
8 / 32
12 / 32
Setwise Correlation / Anti-Correlation
Case
SAE MSE
VAE MSE
SAE Sparsity
VAE Sparsity
3
0.025
0.018
10 / 32
14 / 32
4
0.030
0.022
10 / 32
13 / 32
5
0.028
0.020
11 / 32
14 / 32
Full Correlation Matrix (Case 6)
Model
SAE MSE
VAE MSE
SAE Sparsity
VAE Sparsity
KL Term
SAE
0.032
—
12 / 32
—
—
VAE
—
0.025
—
15 / 32
6.4
In every scenario, the VAE variant achieves lower MSE—often by 20–30 %—while maintaining comparable or higher sparsity. The learned KL term remains moderate, confirming the covariance prior is active but not over-regularizing.
3.5 Qualitative Reconstructions
Below are sample reconstructions for each covariance pattern under both priors. In every plot, the top two row shows ground-truth vectors, both conceptual and sampled from data; the middle row shows vanilla SAE reconstructions; the bottom row shows isotropic Gaussian prior reconstructions.
Independent Features with Varying sparsity
Features with Anti-correlated Pairs
Features with Correlated and Anti-correlated Pairs
Features with Correlation Matrix
Summary
These toy model experiments demonstrate that structured variational inference can restore true latent features where classic SAE objectives struggle to. In this minimal setting we have:
Replicated SAE-like behavior with Σp and α large.
Shown that learning Σp (full) outperforms an unstructured prior.
Constructed a modular framework into which we can plug richer priors—block-diagonal, graph-Laplacian, energy-based—to further improve disentanglement.
These findings motivate our next steps: replacing the fully learned covariance with semantically structured priors (block-diagonal, Laplacian) in Part II, and ultimately integrating these ideas into real‐LM benchmarks. Stay tuned for the coming series!
From Unruly Stacks to Organized Shelves: Toy Model Validation of Structured Priors in Sparse Autoencoders
by Yuxiao Li, Henry Zheng, Zachary Baker, Eslam Zaher, Maxim Panteleev, Maxim Finenko
June 2025 | SPAR Spring ’25
A post in our series “Feature Geometry & Structured Priors in Sparse Autoencoders”
About this series
This post kicks off a multi-part exploration of feature geometry in large-model embeddings and how to bake that geometry into priors for sparse autoencoders (SAEs). We’ve been working since February through the SPAR 2025 program with a fantastic group of mentees, bringing together tools from probability, geometry, and mechanistic interpretability.
Our approach rests on two core intuitions:
Variational framing of SAEs. By casting SAEs as latent-variable models, we can replace ad-hoc L1/TopK penalties with ELBO losses under structured priors p(z), giving a clear probabilistic handle on feature disentanglement.
Feature-space geometry. Real model activations exhibit rich geometric structures; we aim to discover and then encode these structures—via block-diagonal and graph-Laplacian covariances—directly into our priors.
Together, these ideas form a systematic framework for building and evaluating SAEs with inductive biases matched to the true geometry of model features.
Series Table of Contents
➡️ Part I (you are here): Toy model comparison of isotropic vs global-correlation priors in V-SAE
Part II: Block-diagonal & graph-Laplacian structures in LM embeddings
Part III: Crosscoders & Ladder SAEs (multi-layer, multi-resolution coding)
Part IV: Revisiting the Linear Representation Hypothesis (LRH) via geometirc probes
0. Background Story
I. Introduction: Why SAEs & Monosemantic Features?
A growing body of work has begun to chart the geometry of learned representations—and to use that map to guide feature learning:
Toy Models of Superposition (2022) introduced simple synthetic benchmarks illustrating the superposition phenomenon where more features (K) than dimensions (N) force “stacking” of concepts and lead to polysemantic directions. Established the need for specialized dictionary-learning methods.
Towards Monosemanticity: Decomposing LMs with Dictionary Learning (2023) first applied SAEs to LLM residual-stream activations, demonstrating that a learned overcomplete basis can uncover monosemantic feature vectors far more interpretable than individual neurons. Identified feature splitting and entanglement failure modes.
Classic SAEs learn a dictionary Wd to reconstruct activations x∈Rd under a hard-sparsity constraint
minWd,f∥x−Wdf(x)∥22 s.t. ∥f(x)∥0≤K.
These monosemantic features often align with human-interpretable concepts. Yet when true factors are correlated, SAEs fragment (“feature splitting”) or absorb one factor into another, giving polysemantic atoms that obscure mechanistic insight.
II. Variational SAE (V-SAE): Model & Assumptions
2.1 Latent-Variable Formulation
We treat the encoder-decoder as a probabilistic model:
p(x,z)=N(x;Wdz,σ2xI)×p(z)
where z is the latent feature, Wd∈Rd×k is a learned decoder matrix and σ2x is fixed. The first term resembles the decoder and the seconder resembles the prior.
We introduce a Gaussian encoder
qϕ(z|x)=N(μϕ(x),diag(σ2ϕ(x)))
where qϕ is a learnable Gaussian encoder, which outputs mean and diagonal log-variance.
2.2 ELBO Objective
Inspired by variational inference methods, we propose Variational SAEs (V-SAEs). Specifically, we derive the training objective from the Evidence Lower Bound (ELBO) on logpθ(x), which we write as the loss:
minWd,ΦEqϕ(z|x)∥x−Wdz∥2+αKL(qϕ(z|x)∣∣∣∣p(z)),
The hyperparameter α controls the trade-off econstruction accuracy against the strength of the prior.
2.3 Structured Priors Design
In this toy-model study, we compare two Gaussian priors p(z)=N(0,Σp):
Isotropic prior: Σp=σ2pIk
Full-covariance prior: Σp is a free, positive-definite matrix learned (via Cholesky parameterization) alongside Wd.
These two extremes let us isolate the effect of allowing arbitrary latent correlations (full) versus assuming none (iso).
In summary:
Generative model:
p(x,z)=N(x;Wdz,σ2xI)×N(z;0,Σp)
Variational posterior:
qϕ(z|x)=N(μϕ(x),diag(σ2ϕ(x)))
Training loss (negative ELBO)
L(x)=Eqϕ[∥x−Wdz∥2]+βKL(qϕ(z|x)∣∣∣∣N(0,Σp)).
III. Toy Model Experiments
We compare two bottleneck models on synthetic superposition benchmarks:
SAE (Baseline): Traditional sparse autoencoder with an ℓ1 penalty on the latent activations f(x).
VAE (V-SAE Proposal): A variational autoencoder with a sparsity-promoting prior p(z)=N(0,Σp).
3.1 Case Categories
We generate three families of toy datasets, each designed to stress different aspects of correlated superposition:
Basic Cases (Case 1–2):
Case 1: Two orthogonal latent directions of unequal variance (variance ratio 1 : 0.2).
Case 2: Two latent directions at 45° (equal variance), testing angular disentanglement.
Setwise Correlation / Anti-Correlation (Case 3–5):
Case 3: A set of ncorr=3 features all correlated at 0.8.
Case 4: A set of nanticorr=3 features all anti-correlated at –0.8.
Case 5: Mixed correlated and anti-correlated groups.
Full Correlation Matrix (Case 6):
Draw a random k×k positive-definite covariance matrix with specified off-diagonal structure, then sample latents accordingly.
In each case we sample N=10,000 points z∼N(0,Σtrue), project to x=Wtruez with additive Gaussian noise.
3.2 Training Configuration
Encoder/Decoder: 2-layer ReLU MLP (32→64→232 dims for mean and log-variance) → diag-Gaussian posterior → linear decoder
Objective: ELBO with α=1.0 (reconstruction MSE + KL term)
Optimizer: Adam with learning rate 10−3, batch size 256,200 epochs
We compare exactly two priors p(z)=N(0,Σp):
Iso: Σp=σ2pI32
Full: Σp a free cholesky-parameterized covariance
3.3 Evaluation Metrics
We report the two statistics on a held-out test set:
Reconstruction MSE
1N∑Nn=1∥xn−^xn∥22.
Latent Sparsity
The average fraction of non-zero entries in the bottleneck activations f(x) (or posterior mean for the VAE).
KL Term (VAE only)
The final average KL(q(z∣x)∥p(z)) to verify the prior is being enforced.
3.4 Results: Block-Correlated Setting
Basic Cases
Setwise Correlation / Anti-Correlation
Full Correlation Matrix (Case 6)
In every scenario, the VAE variant achieves lower MSE—often by 20–30 %—while maintaining comparable or higher sparsity. The learned KL term remains moderate, confirming the covariance prior is active but not over-regularizing.
3.5 Qualitative Reconstructions
Below are sample reconstructions for each covariance pattern under both priors. In every plot, the top two row shows ground-truth vectors, both conceptual and sampled from data; the middle row shows vanilla SAE reconstructions; the bottom row shows isotropic Gaussian prior reconstructions.
Independent Features with Varying sparsity
Features with Anti-correlated Pairs
Features with Correlated and Anti-correlated Pairs
Features with Correlation Matrix
Summary
These toy model experiments demonstrate that structured variational inference can restore true latent features where classic SAE objectives struggle to. In this minimal setting we have:
Replicated SAE-like behavior with Σp and α large.
Shown that learning Σp (full) outperforms an unstructured prior.
Constructed a modular framework into which we can plug richer priors—block-diagonal, graph-Laplacian, energy-based—to further improve disentanglement.
These findings motivate our next steps: replacing the fully learned covariance with semantically structured priors (block-diagonal, Laplacian) in Part II, and ultimately integrating these ideas into real‐LM benchmarks. Stay tuned for the coming series!