A Comprehensive Mechanistic Interpretability Explainer & Glossary

Link post

This is a linkpost for a very long doc defining, explaining, and giving intuitions and conceptual frameworks for all the concepts I think you should know about when engaging with mechanistic interpretability. If you find the UI annoying, there’s an HTML version here

Why does this doc exist?

The goal of this doc is to be a comprehensive glossary and explainer for Mechanistic Interpretability (focusing on transformer language models), the field of studying how to reverse engineer neural networks.
There’s a lot of complex terms and jargon in the field! And these are often scattered across various papers, which tend to be pretty well-written but not designed to be an introduction to the field as a whole. The goal of this doc is to resolve some research debt and strives to be a canonical source for explaining concepts in the field
I try to go beyond just being a reference that gives definitions, and to actually dig into how to think about a concept. Why does it matter? Why should you care about it? What are the subtle implications and traps to bear in mind? What is the underlying intuition, and how it fits into the rest of the field?
I also go outside pure mechanistic interpretability, and try to define what I see as the key terms in deep learning and in transformers, and how I think about them. If you want to reverse engineer a system, it’s extremely useful to have a deep model of what’s going on inside of it. What are the key components and moving parts, how do they fit together, and how could the model use them to express different algorithms?

How to read this doc?

The first intended way is to use this a reference. When reading papers, or otherwise exploring and learning about the field, coming here and looking up any terms and trying to understand them.
The second intended way is to treat this as a map to the field. My hope is that if you’re new to the field, you can just read through this doc from the top, get introduced to the key ideas, and be able to dig into further sources when confused. And by the end of this, have a pretty good understanding of the key ideas, concepts and results!
It’s obviously not practical to fully explain all concepts from scratch! Where possible, I link to sources that give a deeper explanation of an idea, or to learn more.
- More generally, if something’s not in this glossary, you can often find something good by googling it or searching on alignmentforum.org. If you can’t, let me know!
I frequently go on long tangents giving my favourite intuitions and context behind a concept—it is not at all necessary to understand these (though hopefully useful!), and I recommend moving on if you get confused and skimming these if you feel bored.

A Comprehensive Mechanistic Interpretability Explainer & Glossary

Why does this doc exist?

How to read this doc?

Table of Contents