This is great work! We’ve been working on very similar things at Anthropic recently, also using gradient descent on autoencoders for sparse coding activations, but focusing more on improving the sparse coding technique and loss to be more robust and on extending it to real datasets. Here’s some of the thoughts I had reading this:
I like the description of your more sophisticated synthetic data generation. We’ve only tried using synthetic data without correlations and with uniform frequency. We’ve also tried real models we don’t have the ground truth for but where we can easily visualize the feature directions (1-layer MNIST perceptrons).
I like how the MMC metric has an understandable 0-1 scale. We’ve been using a similar ground-truth loss but a slightly different formulation that uses norms of vector subtraction rather than cosine similarity, which allows non-normalized features, but doesn’t give a nice human-understandable scale.
The different approaches for trying to find the correct dictionary size are great and it’s good to see the results. The stickiness, dead neurons, and comparing to a larger coding result were all stuff we hadn’t looked at. We also have clear loss elbows for synthetic data but haven’t found any for real data yet. This does seem like one of the important unsolved problems.
That orthogonal initialization is one we haven’t seen before. Did you try multiple things and that one worked best? We’ve been using a kind-of-PCA-like algorithm on the activations for our initialization.
Very interesting to hear that you’ve been working on similar things! Excited to see results when they’re ready.
RE synthetic data: I’m a bit less confident in this method of data generation after the feedback below (see Tom Lieberum’s and Ryan Greenblatt’s comments). It may lose some ‘naturalness’ compared with the way the encoder in the ‘toy models of superposition’ puts one-hot features in superposition. It’s unclear whether that matters for the aims of this particular set of experiments, though.
RE metrics: It’s interesting to hear about your alternative to the MMCS metric. Putting the scale in the feature coefficients rather than in the features themselves does make things intuitive!
RE Orthogonal initialization:
IIRC this actually did help things learn faster (but I could be misremembering that, I didn’t make a note at that early stage). But if it does, I’m reasonably confident that it’ll be possible to find even better initialization schemes that work well for these autoencoders. The PCA-like algorithm sounds like a good idea (curious to hear the details!); I’d been thinking of a few similar-sounding things like:
1) Initializing the autoencoder features using noised copies of the left singular values of the weight matrix of the layer that we’re trying to interpret since these define the major axes of variation in the pre-activations, so might resemble the (post-activation) features. Also c.f. Beren and Sid’s work ‘The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable’. Or
2) If we expect the privileged basis hypothesis to apply, then initializing the autoencoder features with noised unit vectors might speed up learning.
This is great work! We’ve been working on very similar things at Anthropic recently, also using gradient descent on autoencoders for sparse coding activations, but focusing more on improving the sparse coding technique and loss to be more robust and on extending it to real datasets. Here’s some of the thoughts I had reading this:
I like the description of your more sophisticated synthetic data generation. We’ve only tried using synthetic data without correlations and with uniform frequency. We’ve also tried real models we don’t have the ground truth for but where we can easily visualize the feature directions (1-layer MNIST perceptrons).
I like how the MMC metric has an understandable 0-1 scale. We’ve been using a similar ground-truth loss but a slightly different formulation that uses norms of vector subtraction rather than cosine similarity, which allows non-normalized features, but doesn’t give a nice human-understandable scale.
The different approaches for trying to find the correct dictionary size are great and it’s good to see the results. The stickiness, dead neurons, and comparing to a larger coding result were all stuff we hadn’t looked at. We also have clear loss elbows for synthetic data but haven’t found any for real data yet. This does seem like one of the important unsolved problems.
That orthogonal initialization is one we haven’t seen before. Did you try multiple things and that one worked best? We’ve been using a kind-of-PCA-like algorithm on the activations for our initialization.
Very interesting to hear that you’ve been working on similar things! Excited to see results when they’re ready.
RE synthetic data: I’m a bit less confident in this method of data generation after the feedback below (see Tom Lieberum’s and Ryan Greenblatt’s comments). It may lose some ‘naturalness’ compared with the way the encoder in the ‘toy models of superposition’ puts one-hot features in superposition. It’s unclear whether that matters for the aims of this particular set of experiments, though.
RE metrics: It’s interesting to hear about your alternative to the MMCS metric. Putting the scale in the feature coefficients rather than in the features themselves does make things intuitive!
RE Orthogonal initialization:
IIRC this actually did help things learn faster (but I could be misremembering that, I didn’t make a note at that early stage). But if it does, I’m reasonably confident that it’ll be possible to find even better initialization schemes that work well for these autoencoders. The PCA-like algorithm sounds like a good idea (curious to hear the details!); I’d been thinking of a few similar-sounding things like:
1) Initializing the autoencoder features using noised copies of the left singular values of the weight matrix of the layer that we’re trying to interpret since these define the major axes of variation in the pre-activations, so might resemble the (post-activation) features. Also c.f. Beren and Sid’s work ‘The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable’. Or
2) If we expect the privileged basis hypothesis to apply, then initializing the autoencoder features with noised unit vectors might speed up learning.
Or other variations on those themes.