Useful starting code for interpretability

Want to try your hand at neural network interpretability? A very nice way to get started is to find an existing Python notebook using one or more interpretability techniques, hopefully one written with beginners in mind. In a click or two you can make a copy of it, which you can typically run without any modification, and then start tweaking it to look at what you’re interested in.

Fortunately, many such notebooks already exist, thanks to helpful members of the interp community! This post is just a list of those, mostly Colab notebooks. Many of them I have no personal experience with, but all of them have been recommended by people who know what they’re doing. This list will probably be acceptably current through late 2024 or so; after that you should use a more up-to-date resource if one exists (although if one existed now I would have used it instead of writing one, so there may or may not be another one then).

Suggestions for other similarly useful starter notebooks for other areas are extremely welcomed!

Notebooks for understanding machine learning (as background): Transformers From Scratch, some other ML technique notebooks, reinforcement learning.

And the main list is in no particular order, so no need to go top to bottom.

@Neel Nanda’s exploratory analysis demo for TransformerLens walks you through many of the basic mech interp techniques, and is highly recommended, and he has others as well.
Another intro to mech interp from ARENA, along with several other excellent notebooks reproducing some important mech interp results:
Two activation steering notebooks, based on “Steering GPT-2-XL by adding an activation vector” (bonus: several different implementations from @Annah) (extra bonus: quick and dirty representation engineering on Mistral)
Developmental interpretability and singular learning theory notebooks, from @Jesse Hoogland.
A smallish notebook on using the tuned lens technique (successor to the logit lens).
Mech interp on Mamba using nnsight.

Thanks to @Jesse Hoogland and @CallumMcDougall for extremely useful input!