Intro to Superposition & Sparse Autoencoders (Colab exercises)

Update (20th December) - these exercises have been edited to fix some previous issues with them, new material has been added. In particular, the “superposition in a privileged basis” section has been rewritten, the neuron resampling methods in the SAEs section are now significantly improved, and there is also additional material on superposition & deep double descent.

This is a linkpost for some exercises in sparse autoencoders, which I’ve recently finished working on as part of the upcoming ARENA 3.0 iteration. Having spoken to Neel Nanda and others in interpretability-related MATS streams, it seemed useful to make these exercises accessible out of the context of the rest of the ARENA curriculum.

Links to Colabs (updated): Exercises, Solutions.

If you don’t like working in Colabs, then you can clone the repo, download the exercises & solutions Colabs as notebooks, and run them in the same directory.

Below is a brief summary of all 7 sets of exercises (you can scroll to the end if you’re mainly interested in sparse autoencoders!).

Guide to Exercises

The sets of exercises are roughly split into three larger sets. Exercises 1-3 are the “core TMS” exercises, which present some of the key ideas behind superposition. Exercises 4-5 are extensions of the TMS work, which are interesting but less essential. Exercises 6-7 contain the material on SAEs.

The exercises are labelled with their prerequisites. (n*) means exercise set n is essential for these exercises, and (n) means exercise set n is heavily recommended (but not essential).

Abbreviations: TMS = Toy Models of Superposition, SAE = Sparse Autoencoders.

  1. TMS: Superposition in a Nonprivileged Basis

  2. TMS: Correlated /​ Anticorrelated Features (1*)

  3. TMS: Superposition in a Privileged Basis (1*)

  4. Feature Geometry (1*)

  5. Superposition & Deep Double Descent (1*, 2)

  6. SAEs in Toy Models (1*, 2, 3)

  7. SAEs in Language Models (1*, 2, 3, 6*)

Below is a longer explanation of each of the seven exercise sets.

TMS: Superposition in a Nonprivileged Basis

This section introduces Anthropic’s toy model for superposition, where a simple neural network is trained to map a set of features into a lower-dimensional space then reconstruct it. You’ll learn about how superposition works & see how it can be visualised, as well as how properties like feature sparsity affect the learned solutions.

TMS: Correlated /​ Anticorrelated Features

In this section, you’ll keep exploring the idea of superposition by seeing how the model’s learned solutions change when features are correlated or anticorrelated. Most features learned by real models are anticorrelated simply as a consequence of the fact that any given model input (e.g. images or passages of text) will only contain a limited number of features.

TMS: Superposition in a Privileged Basis

Next, the toy model setup is changed so that it has a privileged basis. If the previous sections were models of superposition in the residual stream, this section models superposition in the MLP layer. We’ll also explore how computation can be performed in superposition.

Feature Geometry

Here, we take a deeper dive into the ways features can organize into different geometric structures, when we increase the hidden dimension past the point when we can easily visualise it.

Deep Double Descent & Superposition

This section is based on a different Anthropic paper, where they explore the idea that double descent happens as a result of models transitioning from a memorizing solution (representing datapoints in superposition) to a generalizing solution (representing features in superposition). Unlike the other 6 sections here, it’s very open-ended, containing just one exercise, which is to replicate the paper—but this does come with a lot of guidance.

SAEs in Toy Models

We take the toy models from Anthropic’s Toy Models of Superposition paper (which there are also exercises for), and train sparse autoencoders on the representations learned by these toy models. These exercises culminate in using neuron resampling to successfully recover all the learned features from the toy model of bottleneck superposition:

Animation of the training process for SAEs in Anthropic’s toy model of superposition. The red neurons represent resamplings. All instances eventually converge to accurately representing all five of the original model’s features.

SAEs in Language Models

And there are exercises on interpreting an SAE trained on a transformer, where you can discover some cool learned features (e.g. a neuron exhibiting skip trigam-like behaviour, which activates on left-brackets following Django-related sytax, and predicts the completion ('django).

You can either read through the Solutions colab (which has all output displayed & explained), or go through the Exercises colab and fill in the functions according to the specifications you are given, looking at the Solutions when you’re stuck. Both colabs come with test functions you can run to verify your solution works.


Please reach out to me if you have any questions or suggestions about these exercises (either by email at cal.s.mcdougall@gmail.com, or a LessWrong private message /​ comment on this post). Happy coding!