The Jacobians are much more sparse in pre-trained LLMs than in re-initialized transformers.
This would be very cool if true, but I think further experiments are needed to support it.
Imagine a dumb scenario where during training, all that happens to the MLP is that it “gets smaller”, so that MLP_trained(x) = c * MLP_init(x) for some small c. Then all the elements of the Jacobian also get smaller by a factor of c, and your current analysis—checking the number of elements above a threshold—would conclude that the Jacobian had gotten sparser. This feels wrong: merely rescaling a function shouldn’t affect the sparsity of the computation it implements.
To avoid this issue, you could report a scale-invariant quantity like the kurtosis of the Jacobian’s elements divided by their variance-squared, or ratio of L1 and L2 norms, or plenty of other options. But these quantities still aren’t perfect, since they aren’t invariant under linear transformations of the model’s activations:
E.g. suppose an mlp_out feature F depends linearly on some mlp_in feature G, which is roughly orthogonal to F. If we stretch all model activations along the F direction, and retrain our SAEs, then the new mlp_out SAE will contain (in an ideal world) a feature F’ which is the same as F but with larger activations by some factor. On the other hand, the mlp_in SAE should will contain a feature G’ which is roughly the same as G. Hence the (F, G) element of the Jacobian has been made bigger, simply by applying a linear transformation to the model’s activations. Generally this will affect our sparsity measure, which feels wrong: merely applying a linear map to all model activations shouldn’t change the sparsity of the computation being done on those activations. In other words, our sparsity measure shouldn’t depend on a choice of basis for the residual stream.
I’ll try to think of a principled measure of the sparsity of the Jacobian. In the meantime, I think it would still be interesting to see a scale-invariant quantity reported, as suggested above.
I agree with a lot of this, we discuss the trickiness of measuring this properly in the paper (Appendix E.1) and I touched on it a bit in this post (the last bullet point in the last section). We did consider normalizing by the L2, ultimately we decided against that because the L2 indexes too heavily on the size of the majority of elements rather than indexing on the size of the largest elements, so it’s not really what we want. Fwiw I think normalizing by the L4 or the L_inf is more promising.
I agree it would be good for us to report more data on the pre-trained vs randomized thing specifically. I don’t really see that as a central claim of the paper so I didn’t prioritize putting a bunch of stuff for it in the appendices, but I might do a revision with more stats on that, and I really appreciate the suggestions.
This would be very cool if true, but I think further experiments are needed to support it.
Imagine a dumb scenario where during training, all that happens to the MLP is that it “gets smaller”, so that MLP_trained(x) = c * MLP_init(x) for some small c. Then all the elements of the Jacobian also get smaller by a factor of c, and your current analysis—checking the number of elements above a threshold—would conclude that the Jacobian had gotten sparser. This feels wrong: merely rescaling a function shouldn’t affect the sparsity of the computation it implements.
To avoid this issue, you could report a scale-invariant quantity like the kurtosis of the Jacobian’s elements divided by their variance-squared, or ratio of L1 and L2 norms, or plenty of other options. But these quantities still aren’t perfect, since they aren’t invariant under linear transformations of the model’s activations:
E.g. suppose an mlp_out feature F depends linearly on some mlp_in feature G, which is roughly orthogonal to F. If we stretch all model activations along the F direction, and retrain our SAEs, then the new mlp_out SAE will contain (in an ideal world) a feature F’ which is the same as F but with larger activations by some factor. On the other hand, the mlp_in SAE should will contain a feature G’ which is roughly the same as G. Hence the (F, G) element of the Jacobian has been made bigger, simply by applying a linear transformation to the model’s activations. Generally this will affect our sparsity measure, which feels wrong: merely applying a linear map to all model activations shouldn’t change the sparsity of the computation being done on those activations. In other words, our sparsity measure shouldn’t depend on a choice of basis for the residual stream.
I’ll try to think of a principled measure of the sparsity of the Jacobian. In the meantime, I think it would still be interesting to see a scale-invariant quantity reported, as suggested above.
I agree with a lot of this, we discuss the trickiness of measuring this properly in the paper (Appendix E.1) and I touched on it a bit in this post (the last bullet point in the last section). We did consider normalizing by the L2, ultimately we decided against that because the L2 indexes too heavily on the size of the majority of elements rather than indexing on the size of the largest elements, so it’s not really what we want. Fwiw I think normalizing by the L4 or the L_inf is more promising.
I agree it would be good for us to report more data on the pre-trained vs randomized thing specifically. I don’t really see that as a central claim of the paper so I didn’t prioritize putting a bunch of stuff for it in the appendices, but I might do a revision with more stats on that, and I really appreciate the suggestions.