jacob_drori

Karma: 296

SAE on activation differences

Santiago Aranguri, jacob_drori and Neel Nanda

Jun 30, 2025, 5:50 PM

44 points

2 comments5 min readLW link

jacob_drori Jun 29, 2025, 3:17 AM
1 point
0
on: Sparsely-connected Cross-layer Transcoders
It may be possible to massively reduce memory usage in sparsely-connected mode.
Let $B$ be batch size, $K$ be num active latents per dictionary per token, and $F$ be num latents per dictionary.
My current implementation of sparsely-connected mode has a terrible $O (F^{2})$ memory usage, since each virtual weight matrix has $F^{2}$ elements. But how many of these virtual weights do we actually need to compute?
Upstream latents: On each token in the batch, we only need the virtual weights connecting to the $K$ active upstream latents.
Downstream latents: Strictly speaking, we should compute activations for every downstream latent, since we don’t know in advance which will be active. But, insofar as vanilla mode closely approximates sparsely-connected mode, we should be okay to only compute virtual weights connecting to downstream latents that were active in vanilla mode.
So on each token, we only need to compute $K^{2}$ virtual weights, and so the memory requirement is $B K^{2}$ , which is small.
Of course, this new approach loses something: sparsely-connected mode now relies on vanilla mode to tell it which latents should activate. So much for a standalone replacement model! I think a reasonable middle-ground is to only compute virtual weights to the $100 \times K$ (say) latents with largest vanilla preactivation. Then compute sparsely-connected preactivations for all those latents, and apply TopK to get the activations. The memory usage is then $100 B K^{2}$ which is still small.
What links here?
- Sparsely-connected Cross-layer Transcoders by jacob_drori (Jun 18, 2025, 5:13 PM; 44 points)

jacob_drori Jun 23, 2025, 8:27 AM
2 points
0
on: The Croissant Principle: A Theory of AI Generalization
Related: https://arxiv.org/abs/2004.10802

jacob_drori Jun 19, 2025, 3:47 PM
1 point
0
in reply to: williawa’s comment on: Sparsely-connected Cross-layer Transcoders
The approach you suggest feels similar in spirit to Farnik et al, and I think it’s a reasonable thing to try. However, I opted for my approach since it produces an exactly sparse forward pass, rather than just suppressing the contribution of weak connections. So no arbitrary threshold must be chosen when buikding a feature circuit/attribution graph. Either two latents are connected, or they are not.

I also like the fact that my approach gives us sparse global virtual weights, which allows us to study global circuits—something Anthropic had problems with due to interference weights in their approach.

Sparsely-connected Cross-layer Transcoders

jacob_droriJun 18, 2025, 5:13 PM

44 points

3 comments12 min readLW link

jacob_drori Mar 27, 2025, 5:14 PM
1 point
0
on: Memorization-generalization in practice
Vary temperature t and measure the resulting learning coefficient function $λ (t)$
This confuses me. IIUC, $p_{β, n} (w) = \frac{φ (w) exp (- n β L_{n} (w))}{\int d w^{'} φ (w^{'}) exp (- n β L_{n} (w^{'}))}$ . So changing temperature is equivalent to rescaling the loss by a constant. But such a rescaling doesn’t affect the LLC.

What did I misunderstand?

jacob_drori Mar 7, 2025, 4:26 PM
2 points
1
in reply to: Lucius Bushnaq’s comment on: Estimating the Probability of Sampling a Trained Neural Network at Random
Let $V (ϵ)$ be volume of a behavioral region at cutoff $ϵ$ . Your behavioral LLC at finite noise scale is $λ (ϵ) := d log V / d log ϵ$ , which is invariant under rescaling $V$ by a constant. This information about the overall scale of $V$ seems important. What’s the reason for throwing it out in SLT?

jacob_drori Feb 27, 2025, 6:03 AM
3 points
2
in reply to: Daniel Tan’s comment on: Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
Fantastic research! Any chance you’ll open-source weights of the insecure qwen model? This would be useful for interp folks.

jacob_drori Feb 26, 2025, 10:03 PM
8 points
0
on: [PAPER] Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations
The Jacobians are much more sparse in pre-trained LLMs than in re-initialized transformers.
This would be very cool if true, but I think further experiments are needed to support it.
Imagine a dumb scenario where during training, all that happens to the MLP is that it “gets smaller”, so that MLP_trained(x) = c * MLP_init(x) for some small c. Then all the elements of the Jacobian also get smaller by a factor of c, and your current analysis—checking the number of elements above a threshold—would conclude that the Jacobian had gotten sparser. This feels wrong: merely rescaling a function shouldn’t affect the sparsity of the computation it implements.
To avoid this issue, you could report a scale-invariant quantity like the kurtosis of the Jacobian’s elements divided by their variance-squared, or ratio of L1 and L2 norms, or plenty of other options. But these quantities still aren’t perfect, since they aren’t invariant under linear transformations of the model’s activations:
E.g. suppose an mlp_out feature F depends linearly on some mlp_in feature G, which is roughly orthogonal to F. If we stretch all model activations along the F direction, and retrain our SAEs, then the new mlp_out SAE will contain (in an ideal world) a feature F’ which is the same as F but with larger activations by some factor. On the other hand, the mlp_in SAE should will contain a feature G’ which is roughly the same as G. Hence the (F, G) element of the Jacobian has been made bigger, simply by applying a linear transformation to the model’s activations. Generally this will affect our sparsity measure, which feels wrong: merely applying a linear map to all model activations shouldn’t change the sparsity of the computation being done on those activations. In other words, our sparsity measure shouldn’t depend on a choice of basis for the residual stream.
I’ll try to think of a principled measure of the sparsity of the Jacobian. In the meantime, I think it would still be interesting to see a scale-invariant quantity reported, as suggested above.

jacob_drori Jan 23, 2025, 5:56 PM
1 point
0
on: Against blanket arguments against interpretability
We have pretty robust measurements of complexity of algorithms from SLT
This seems overstated. What’s the best evidence so far that the LLC positively correlates with the complexity of the algorithm implemented by a model? In fact, do we even have any models whose circuitry we understand well enough to assign them a “complexity”?
… and it seems like similar methods can lead to pretty good ways of separating parallel circuits (Apollo also has some interesting work here that I think constitutes real progress)
Citation?

jacob_drori Jan 18, 2025, 10:59 AM
1 point
0
in reply to: tailcalled’s comment on: The quantum red pill or: They lied to you, we live in the (density) matrix
Same difference

jacob_drori Jan 18, 2025, 10:57 AM
1 point
0
in reply to: Dmitry Vaintrob’s comment on: The quantum red pill or: They lied to you, we live in the (density) matrix
I’d prefer “basis we just so happen to be measuring in”. Or “measurement basis” for short.
You could use “pointer variable”, but this would commit you to writing several more paragraphs to unpack what it means (which I encourage you to do, maybe in a later post).

jacob_drori Jan 18, 2025, 1:36 AM
3 points
2
on: The quantum red pill or: They lied to you, we live in the (density) matrix
Your use of “pure state” is totally different to the standard definition (namely rank(rho)=1). I suggest using a different term.

jacob_drori Jan 18, 2025, 1:33 AM
1 point
0
in reply to: tailcalled’s comment on: The quantum red pill or: They lied to you, we live in the (density) matrix
The QM state space has a preferred inner product, which we can use to e.g. dualize a (0,2) tensor (i.e. a thing that eats takes two vectors and gives a number) into a (1,1) tensor (i.e. an operator). So we can think of it either way.

jacob_drori Jan 5, 2025, 10:11 PM
2 points
0
in reply to: Cheng-Han Chiang’s comment on: Domain-specific SAEs
Oops, good spot! I meant to write 1 minus that quantity. I’ve edited the OP.

jacob_drori Oct 21, 2024, 5:47 AM
1 point
0
on: Exploring the Platonic Representation Hypothesis Beyond In-Distribution Data
This seems very interesting, but I think your post could do with a lot more detail. How were the correlations computed? How strongly do they support PRH? How was the OOD data generated? I’m sure the answers could be pieced together from the notebook, but most people won’t click through and read the code.

jacob_drori Oct 14, 2024, 8:45 PM
3 points
2
in reply to: Lucius Bushnaq’s comment on: Circuits in Superposition: Compressing many small neural networks into one
Ah, I think I understand. Let me write it out to double-check, and in case it helps others.

Say $δ = 0$ , for simplicity. Then $A^{l} = \sum_{t} E_{t} a_{t}^{l}$ . This sum has $k$ nonzero terms.

In your construction, $W^{l, i n} = \sum_{t} V_{t}^{l} W_{t}^{l, i n} E_{t}^{T}$ . Focussing on a single neuron, labelled by $i$ , we have $(W^{l, i n})_{i} = \sum_{t} (V_{t}^{l})_{i} W_{t}^{l, i n} E_{t}^{T}$ . This sum has $\sim p T$ nonzero terms.

So the preactivation of an MLP hidden neuron in the big network is $p_{i}^{l} = \sum_{t, t^{'}} (V_{t}^{l})_{i} W_{t}^{l, i n} E_{t}^{T} E_{t^{'}} a_{t^{'}}^{l}$ . This sum has $\sim k p T$ nonzero terms.
We only “want” the terms where $t = t^{'}$ ; the rest (i.e. the majority) are noise. Each noise term in the sum is a random vector, so each of the $\sim k p T$ different noise terms are roughly orthogonal, and so the norm of the noise is $O (\sqrt{k p T})$ (times some other factors, but this captures the $T$ -dependence, which is what I was confused about).

jacob_drori Oct 14, 2024, 5:25 PM
1 point
0
on: Circuits in Superposition: Compressing many small neural networks into one
I’m confused by the read-in bound:
$ϵ_{t}^{l, i n} = O (w a \sqrt{k T \frac{m d}{M D} log M})$
Sure, each neuron reads from $T \frac{n log M}{M}$ of the random subspaces. But in all but $k$ of those subspaces, the big network’s activations are smaller than $δ$ , right? So I was expecting a tighter bound—something like:
$ϵ_{t}^{l, i n} = O (w a \sqrt{(k + T δ) \frac{m d}{M D} log M})$

There is a globe in your LLM

jacob_droriOct 8, 2024, 12:43 AM

89 points

4 comments1 min readLW link

Domain-specific SAEs

jacob_droriOct 7, 2024, 8:15 PM

28 points

2 comments5 min readLW link

jacob_drori

SAE on ac­ti­va­tion differences

Sparsely-con­nected Cross-layer Transcoders

There is a globe in your LLM

Do­main-spe­cific SAEs

SAE on activation differences

Sparsely-connected Cross-layer Transcoders

Domain-specific SAEs