jacob_drori
SAE on activation differences
The approach you suggest feels similar in spirit to Farnik et al, and I think it’s a reasonable thing to try. However, I opted for my approach since it produces an exactly sparse forward pass, rather than just suppressing the contribution of weak connections. So no arbitrary threshold must be chosen when buikding a feature circuit/attribution graph. Either two latents are connected, or they are not.
I also like the fact that my approach gives us sparse global virtual weights, which allows us to study global circuits—something Anthropic had problems with due to interference weights in their approach.
Sparsely-connected Cross-layer Transcoders
Vary temperature t and measure the resulting learning coefficient function
This confuses me. IIUC, . So changing temperature is equivalent to rescaling the loss by a constant. But such a rescaling doesn’t affect the LLC.
What did I misunderstand?
Let be volume of a behavioral region at cutoff . Your behavioral LLC at finite noise scale is , which is invariant under rescaling by a constant. This information about the overall scale of seems important. What’s the reason for throwing it out in SLT?
Fantastic research! Any chance you’ll open-source weights of the insecure qwen model? This would be useful for interp folks.
The Jacobians are much more sparse in pre-trained LLMs than in re-initialized transformers.
This would be very cool if true, but I think further experiments are needed to support it.
Imagine a dumb scenario where during training, all that happens to the MLP is that it “gets smaller”, so that MLP_trained(x) = c * MLP_init(x) for some small c. Then all the elements of the Jacobian also get smaller by a factor of c, and your current analysis—checking the number of elements above a threshold—would conclude that the Jacobian had gotten sparser. This feels wrong: merely rescaling a function shouldn’t affect the sparsity of the computation it implements.
To avoid this issue, you could report a scale-invariant quantity like the kurtosis of the Jacobian’s elements divided by their variance-squared, or ratio of L1 and L2 norms, or plenty of other options. But these quantities still aren’t perfect, since they aren’t invariant under linear transformations of the model’s activations:
E.g. suppose an mlp_out feature F depends linearly on some mlp_in feature G, which is roughly orthogonal to F. If we stretch all model activations along the F direction, and retrain our SAEs, then the new mlp_out SAE will contain (in an ideal world) a feature F’ which is the same as F but with larger activations by some factor. On the other hand, the mlp_in SAE should will contain a feature G’ which is roughly the same as G. Hence the (F, G) element of the Jacobian has been made bigger, simply by applying a linear transformation to the model’s activations. Generally this will affect our sparsity measure, which feels wrong: merely applying a linear map to all model activations shouldn’t change the sparsity of the computation being done on those activations. In other words, our sparsity measure shouldn’t depend on a choice of basis for the residual stream.
I’ll try to think of a principled measure of the sparsity of the Jacobian. In the meantime, I think it would still be interesting to see a scale-invariant quantity reported, as suggested above.
We have pretty robust measurements of complexity of algorithms from SLT
This seems overstated. What’s the best evidence so far that the LLC positively correlates with the complexity of the algorithm implemented by a model? In fact, do we even have any models whose circuitry we understand well enough to assign them a “complexity”?
… and it seems like similar methods can lead to pretty good ways of separating parallel circuits (Apollo also has some interesting work here that I think constitutes real progress)
Citation?
Same difference
I’d prefer “basis we just so happen to be measuring in”. Or “measurement basis” for short.
You could use “pointer variable”, but this would commit you to writing several more paragraphs to unpack what it means (which I encourage you to do, maybe in a later post).
Your use of “pure state” is totally different to the standard definition (namely rank(rho)=1). I suggest using a different term.
The QM state space has a preferred inner product, which we can use to e.g. dualize a (0,2) tensor (i.e. a thing that eats takes two vectors and gives a number) into a (1,1) tensor (i.e. an operator). So we can think of it either way.
Oops, good spot! I meant to write 1 minus that quantity. I’ve edited the OP.
This seems very interesting, but I think your post could do with a lot more detail. How were the correlations computed? How strongly do they support PRH? How was the OOD data generated? I’m sure the answers could be pieced together from the notebook, but most people won’t click through and read the code.
Ah, I think I understand. Let me write it out to double-check, and in case it helps others.
Say , for simplicity. Then . This sum has nonzero terms.
In your construction, . Focussing on a single neuron, labelled by , we have . This sum has nonzero terms.
So the preactivation of an MLP hidden neuron in the big network is . This sum has nonzero terms.We only “want” the terms where ; the rest (i.e. the majority) are noise. Each noise term in the sum is a random vector, so each of the different noise terms are roughly orthogonal, and so the norm of the noise is (times some other factors, but this captures the -dependence, which is what I was confused about).
I’m confused by the read-in bound:
Sure, each neuron reads from of the random subspaces. But in all but of those subspaces, the big network’s activations are smaller than , right? So I was expecting a tighter bound—something like:
It may be possible to massively reduce memory usage in sparsely-connected mode.
Let B be batch size, K be num active latents per dictionary per token, and F be num latents per dictionary.
My current implementation of sparsely-connected mode has a terrible O(F2) memory usage, since each virtual weight matrix has F2 elements. But how many of these virtual weights do we actually need to compute?
Upstream latents: On each token in the batch, we only need the virtual weights connecting to the K active upstream latents.
Downstream latents: Strictly speaking, we should compute activations for every downstream latent, since we don’t know in advance which will be active. But, insofar as vanilla mode closely approximates sparsely-connected mode, we should be okay to only compute virtual weights connecting to downstream latents that were active in vanilla mode.
So on each token, we only need to compute K2 virtual weights, and so the memory requirement is BK2, which is small.
Of course, this new approach loses something: sparsely-connected mode now relies on vanilla mode to tell it which latents should activate. So much for a standalone replacement model! I think a reasonable middle-ground is to only compute virtual weights to the 100×K (say) latents with largest vanilla preactivation. Then compute sparsely-connected preactivations for all those latents, and apply TopK to get the activations. The memory usage is then 100BK2 which is still small.