Useful starting code for interpretability

Want to try your hand at neural network interpretability? A very nice way to get started is to find an existing Python notebook using one or more interpretability techniques, hopefully one written with beginners in mind. In a click or two you can make a copy of it, which you can typically run without any modification, and then start tweaking it to look at what you’re interested in.

Fortunately, many such notebooks already exist, thanks to helpful members of the interp community! This post is just a list of those, mostly Colab notebooks. Many of them I have no personal experience with, but all of them have been recommended by people who know what they’re doing. This list will probably be acceptably current through late 2024 or so; after that you should use a more up-to-date resource if one exists (although if one existed now I would have used it instead of writing one, so there may or may not be another one then).

Suggestions for other similarly useful starter notebooks for other areas are extremely welcomed!

And the main list is in no particular order, so no need to go top to bottom.

Thanks to @Jesse Hoogland and @CallumMcDougall for extremely useful input!