To learn more about this work, check out the paper. We assume general familiarity with transformer circuits.
There isn’t much interpretability work that explains end-to-end how a model is able to do some task (except for toy models). In this work, we make progress towards this goal by understanding some of the structure of GPT-2 small “in the wild” by studying how it computes a simple natural language task.
The task we investigate is what we call indirect object identification (IOI), where sentences like “When John and Mary went to the store, John gave a drink to” should be completed with “Mary” as opposed to “John”.
We discovered the structure of a circuit of 26 attention heads grouped into 7 main classes, the largest end-to-end attempt to reverse engineer how a LM computes a natural behavior (to our knowledge). There is still much missing from our explanation, however, and our explanation doesn’t go to the parameter level.
Besides discovering the particular circuit shown above, we gained some interesting insights about low-level phenomena arising inside language models. For example, we found attention heads communicating with pointers (sharing the location of a piece of information instead of copying it). We also identified heads compensating for the loss of function of other heads, and heads contributing negatively to the correct next-token prediction. We’re excited to see if these discoveries generalize beyond our case study.
Since explanations of model behavior can be confused or non-rigorous, we used our knowledge to design adversarial examples. Moreover, we formulate 3 quantitative criteria to test the validity of our circuit. These criteria partially validate our circuit but indicate that there are still gaps in our understanding.
This post is a companion post to our paper where we share lessons that we learned from doing this work and describe some of Redwood’s interpretability perspectives. We share high-level takeaways and give specific examples from the work to illustrate them.
Goals of Interpretability for this Investigation
In this kind of mechanistic interpretability work, we tend to use the circuits abstraction. If we think of a model as a computational graph where nodes are terms in its forward pass (neurons, attention heads, etc) and edges are the interactions between those terms (residual connections, attention, etc), a circuit is a subgraph of the model responsible for some behavior.
Note that our work is slightly different from Chris Olah’s/Anthropic’s idea of a circuit in that we investigate this circuit on a specific distribution (instead of the entire distribution of text) and we also don’t attain an understanding of the circuit at the parameter-level.
Structure: Having all the nodes and edges
One kind of valuable interpretability insight is to ensure that we have the correct subgraph, which is one of the main goals in this work.
We formulate three main quantitative criteria to measure progress towards this goal. These criteria rely on the idea of “knocking out” or turning off a node from the computational graph by replacing its activation by its mean on a distribution where the IOI task is not present. (Note that we now believe that our causal scrubbing algorithm provides a more robust way to validate circuits.)
Intuition: The circuit performs the task as well as the model.
Computation: A model with everything but the circuit knocked out should achieve the same performance according to some task benchmark.
Intuition: The circuit contains all the nodes that are important for the task. (Note that completeness implies faithfulness).
Computation: For every subset of nodes in the circuit, knocking out that subset in the circuit causes a similar change in performance when knocking out that subset in the whole model. Both the circuit and model should be similar under knockouts.
Intuition: The circuit does not contain any irrelevant nodes for the task.
Computation: For every node in the circuit, there exists a subset such that knocking out the subset and the node causes a large change in performance compared to knocking out the subset alone. Each node has a large counterfactual impact on performance in at least one scenario.
One particularly challenging task to meet these criteria is to understand distributed behaviors: behaviors where many components each contributes a little to compute some behavior in aggregate. We faced such a behavior when investigating the origin of the S-Inhibition Heads’ attention patterns, one crucial class of heads in our circuit that bias the model output against the Subject token and towards the correct Indirect Object token. Unfortunately, we suspect that most LM behaviors consist of massive amounts of correlations/heuristics implemented in a distributed way.
Another important, though less challenging, task is to understand redundant behaviors: sometimes several model components appear to have identical behaviors when we study specific tasks. Our circuit is littered with redundant behaviors: all main classes of our circuit contain multiple attention heads.
Semantic Understanding: Understanding the information flow
Even if you identify a complete and minimal set of components important for a model behavior, you still might not actually understand what each component in the circuit does.
Thus, a first important goal of interpretability is to understand what semantic information is being moved from node to node.
One class of nodes whose semantic role we understand pretty robustly is the Name Mover Heads. Here, we are pretty confident that these attention heads copy name tokens. Specifically, we find that the OV circuit of these heads copies names, and that how strongly Name Mover Heads write a certain name into the residual stream is correlated with how much attention they pay to it.
Gaining this kind of semantic understanding is particularly difficult because the model’s internal representations are a) different from our own and b) different from the input embedding space. Being unable to tie model representations with facts about the input makes any sort of understanding difficult. Causal interventions are nice because they localize aspects of the model and link them to specific facts about the input, furthering our understanding.
Our semantic knowledge of how the circuit performs IOI can be summarized in a simple algorithm. On the example sentence given in introduction “When John and Mary went to the store, John gave a drink to”
Identify all previous names in the sentence (Mary, John, John).
Remove all names that are duplicated (in the example above: John).
Output the remaining name (Mary).
Adversarial Examples: Using your understanding to generate predictions
Understanding all the nodes, edges, and information they move around is valuable, but, at the end of the day, we want to use this understanding to do useful things. In the future, we hope that powerful enough interpretability will enable discovery of surprising behavior or adversarial examples, potentially helping us fix or modify model behavior. Thus, seeing if our understanding of the IOI circuit in GPT-2 small helps us generate adversarial examples or any engineering-relevant insights could be useful (as discussed here).
Indeed, we were able to find adversarial examples based on our understanding of the IOI circuit. For example, prompts that repeat the indirect object token (the name of the person that receive something, IO in short) sometimes cause the model to output the subject token (the name of the person giving something, S in short) instead of the correct IO token i.e. GPT-2 small generate completions like “Then, David and Elizabeth were working at the school. David had a good day. Elizabeth decided to give a bone to Elizabeth”. Since the IOI circuit identifies IO by recognizing that it is the only non-duplicated name in the sentence, we can trick the circuit by also duplicating the IO token in a distractor sentence. This adversarial attack made the model predict S over IO 23.4% of the time (compared to 0.7% without distraction).
Ultimately, because of the simplicity of this task, these adversarial examples aren’t that mindblowing: we could have found them without the circuit but with more messing around. However, we’re pretty happy that our understanding of the circuit enables us to find them easily (one author spent 1h thinking about adversarial examples before finding these). Understanding these adversarial examples further would be valuable future work.
Lessons we learned:
Causal Interventions are extremely powerful
The easiest way to get human-understanding of model internals is to understand how they transform/move information about the input. For example, looking at attention patterns provided a lot of intuition for what the heads in our circuit were doing during initial work.
With the idea that we should try and base our understanding of model internals off the input, we can use carefully designed causal interventions to tie aspects of model internals to facts about the input. A powerful causal intervention technique is activation patching, where you replace the activation of a component with its activation on another input. In this work, we develop a more targeted type of activation patching, which we call path patching. Path patching helps us measure the direct effect of an attention head on the key, query or value of another head, removing the effect of intermediate attention heads. We use both extensively in our work.
For example, we have found initial evidence that the S-Inhibition Heads move positional information around (described in Appendix A of the paper). If you patch head outputs from prompts of the form ABBA (Then, John and Mary…. Mary…) to BABA (Then, Mary and John, Mary….) and vice versa, you cause a large drop in performance. Since the only information that changes is where the names are located, this result implies that the S-Inhibition Heads give positional clues to tell where the Name Movers should pay attention to.
Causal interventions can precisely investigate the importance of a particular path of information flow for a behavior in a model. Additionally, they can always be quantified with the same metric, enabling easy comparison (in this work, we always measure the difference between the S and IO logit).
Causal interventions like activation patching are particularly useful when interpreting algorithmic tasks with schema-like inputs. Algorithmic tasks have easy-to-generate distributions of inputs with well-defined model behavior, which enable clearer interpretations of causal interventions.
For example, in the IOI distribution of prompts, prompts look like“ Then, [A] and [B] went to the [PLACE]. [B] gave an [OBJECT] to” where there is a clear indirect object. We can also easily create another distribution of prompts, which we call the ABC distribution, that looks like “Then, [A] and [B] went to the [PLACE]. [C] gave an [OBJECT] to” where there is no single indirect object. Intuitively, replacing activations from the IOI distribution at key nodes with the activations from the ABC distribution should decrease the model’s probability of outputting a clear indirect object.
Knockouts and Patching are Tricky: Do You Have the Right Distribution?
However, patching from the right distribution is pretty tricky. When you replace the output of a node with certain activations (either in patching or knockouts), it’s important to think about what information is present in those replacement activations, why it’s useful for that information to be present, and what effect that information will have on the rest of the model’s behavior.
As an example, let’s focus on mean ablations (replacing the output of an element by its mean activation). In earlier versions of this work, when we mean-ablated everything but the circuit to measure faithfulness, completeness, and minimality, we would replace unimportant nodes with their mean on a specific template in the IOI distribution. We wanted to mean-ablate to remove the effect of a node while still not destroying model internals (See Knockout section in the paper for more information).
This decision ended up being particularly bad because the mean activation over the IOI distribution still contained information that helped compute the task. For example, the circuit we propose involves induction heads and duplicate token heads which function to detect the duplicated name. Despite this, we initially found that mean-ablating the induction and duplicate token heads had little effect. Indeed, even if their mean activation did not contain which name was duplicated, it still contained the fact that a name was duplicated. This means that the functionality of those heads was not hampered with this early version of knockouts.
How We Chose This Problem: Picking the Task is Important
We like to call this kind of interpretability work “streetlight interpretability.” In a sense, we’re under a streetlight, interpreting the behaviors which we’re aware of and can see clearly under the light, while being unaware of all the other behaviors that lurk in the night. We try to interpret tasks that we think we’ll succeed at interpreting, creating selection pressure for easy-to-interpret behaviors. The IOI task, as a behavior under the streetlight, is probably unrepresentative of how easy it is to do interpretability in general and is not representative of the set of model behaviors you might have wanted to understand.
Picking the task to interpret is a large part of this work. We picked the IOI task, in large part, because it’s a crisp, algorithmic task (and thus easier to interpret). We discuss why we chose this specific problem in the following subsections:
Why focus on a big(ger) model over a small model?
We wanted to focus on a bigger model to learn more about the difficulties of large models. We were particularly excited to see if we could find evidence that mechanistic interpretability of large language models wasn’t doomed.
Why focus on an actual language model (that contains other behaviors/distractions) vs a toy model (with no distractions)?
Both approaches have their advantages. Working on toy models is nice because it’s a lot easier, which enables more complete understanding. On the other hand, working on an actual language model is a lot harder, but you can be a bit more confident that lessons learned will generalize to bigger, more capable models.
Why focus on crisp, algorithmic tasks vs soft heuristics/bigrams?
Algorithmic tasks (like IOI) are easier to interpret than bigram-y, heuristic-y tasks. One way to think about this is that algorithmic tasks need to compute discrete steps, which are more likely to create circuit structures with discrete components.
Another way to think about this is that algorithmic tasks are more likely to be coherent i.e. a model behavior is largely produced by a single circuit instead of an ensemble of circuits. One reason picking a behavior (and representative distribution to study the behavior) is difficult is because you don’t know whether the behavior is completed by the same circuit across the entire distribution.
In our work, it’s probably true that the circuits used for each template are actually subtly different in ways we don’t understand. As evidence for this, the standard deviation of the logit difference is ~ 40% and we don’t have good hypotheses to explain this variation. It is likely that the circuit that we found was just the circuit that was most active across this distribution.
If you have feedback for this work, we’d love to hear it! Reach out to firstname.lastname@example.org, email@example.com, firstname.lastname@example.org if you have thoughts or comments you want to share.
How you can do this work
We’re far from understanding everything about IOI, and there are many more exciting interpretability tasks to do. For IOI specifically, there are many not-well-understood mechanisms, and investigating these mechanisms would definitely lead to some cool stuff. For example, our understanding of the MLPs, the attention patterns of the S-Inhibition Heads, Layer Norm, the IOI circuit in different and larger models, understanding the circuit on adversarial examples and legibility related things are all lacking.
Doing this kind of interpretability is easy with EasyTransformer, a transformer interpretability library made by Neel Nanda, which we’ve added some functionality to. Check out our GitHub repo and this Colab Notebook to reproduce some of our key results.
We are looking for more algorithmic tasks that GPT-2 small can do to do interpretability on! Check out this post on our criteria for desirable behaviors from GPT-2 and the web tool for playing with the model.
If you want to do this kind of research, Redwood is running an interpretability sprint, which we’re calling REMIX (Redwood Research Mechanistic Interpretability Experiment) in January. Learn more about REMIX here.
Since our sentence were generated by randomly sampling places and objects to fill templates, the result can be quite silly sometimes.
Really excited to see this come out! This feels like one of my favourite interpretability papers in a while. The part of this paper I found most surprising/compelling was just seeing the repeated trend of “you do a natural thing, form a plausible hypothesis with some evidence. Then dig further, discover your hypothesis was flawed in a subtle way, but then patch things to be better”. Eg, the backup name movers are fucking wild.
I recently applied causal scrubbing to test the hypothesis outlined in the paper (as part of my work at Redwood Research). The hypothesis was defined from the circuit presented in Figure 2. I used a simple setting similar to the experiments on Induction Heads. I used two types of inputs:
xref, the correct input for the circuit.
xscrub, an input with the same template but a randomized subject and indirect object. Used as input for the path not included in the circuit.
I allowed all MLPs on every path of the circuit. The only attention heads non-scrubbed in this hypothesis are the ones from the circuit, split by key, queries, values, and position as in the circuit diagram.
This experiment is directly addressing our claim in the paper, as we did not study MLPs (i.e. they are always acting as black boxes in our experiments).
The logit difference of the scrubbed model is 1.854±2.127 (mean±std), 50%±57 of the original logit difference.
I connected Name Movers’ keys and values to MLP0 at the IO and S1 positions. All the paths from embeddings to these MLP0 are allowed.
I allowed all the paths involving all attention heads and MLPs from the embeddings to
The queries of S-Inhibition Heads at the END position.
The queries of the Duplicate Token Heads and Induction Heads at the S2 position.
The keys and values of the Duplicate Token Heads at the S1 and IO position.
The value of Previous Token heads at the S1 and IO position.
Inside the circuit, only the direct interactions between heads are preserved.
The logit difference of the scrubbed model is 0.831±2.127 (22%±64 of the original logit difference).
How to interpret these numbers? I don’t really know. My best guess is that the circuit we presented in the paper is one of the best small sets of paths to look at to describe how GPT-2 small achieves high logit difference on IOI. However, in absolute, many more paths that we don’t have a good way to describe succinctly matter.
The measures are extremely noisy. This is consistent with the measure of logit difference on the original model, where the standard deviation was around 30% of the mean logit difference. However, I ran causal scrubbing on a dataset with big enough samples (N=100) that the numbers are interpretable. I don’t understand the source of this noise, but suspect it is caused by various names triggering various internal structures inside GPT-2 small.
How does it compare to the validations from the paper? The closest validation score from the paper is the faithfulness score—computing the logit difference of the model where all the nodes not in the circuit are mean ablated. In the paper, we present a score of 87% of the logit difference. I think the discrepancies with the causal scrubbing results from experiment 1 come from the fact that i) resampling ablation is more destructive than mean ablation in this case. ii) that causal scrubbing is stricter as it selects paths and not nodes (e.g. in the faithfulness tests, Name Mover Heads at the END position see the correct output of Induction Heads at the S2 position. This is not the case in causal scrubbing). The results of the causal scrubbing experiments update me toward thinking that our explanation is not as good as I thought from our faithfulness score, but not low enough to strongly question the claims from the paper.
An interesting aspect of the heuristic the model found is that it’s wrong. That’s why it’s possible to construct adversarial examples that trick the heuristic.
I think if I’m going to accuse the model’s heuristic of being “wrong” then it’s only fair that I provide an alternative. Here’s an attempt at explaining why “Mary” is the right answer to “When Mary and John went to the store, John gave a drink to”:
John probably gives the drink to one of the people in the context (John or Mary).
If John were the receiver, we’d usually say “John gave himself a drink”. (Probably not “John gave a drink to himself”, and never “John gave a drink to John”.)
The only person left is Mary, so Mary is probably the receiver.
Instead the model “cheats” with a heuristic that might work quite often on the training set but doesn’t properly understand what’s going on, which makes it generalize poorly to adversarial examples.
I wonder whether this wrongness just reflects the smallness of GPT2-small, or whether it’s found in larger models too. Do the larger models get better performance because they find correct heuristics instead, or because they develop a more diverse set of wrong heuristics?
Very interesting work! I have a couple questions.
Looking at your example, “Then, David and Elizabeth were working at the school. Elizabeth had a good day. Elizabeth decided to give a bone to Elizabeth”. I’m confused. You say “duplicating the IO token in a distractor sentence”, but I thought David would be the IO here?
I also tried this sentence in your tool and only got a 2.6% probability for the Elizabeth completion.
However, repeating the David token raises that to 8.1%.
Am I confused about the meaning of the IO or was there just a typo in the example?
I’d love if you could expand on this (maybe with an example). It sounds like you’re implying that the circuit you found is not complete?
Thanks for your comment!
You are right, there is a typo here. The correct sentence is “Then, David and Elizabeth were working at the school. David had a good day. Elizabeth decided to give a bone to Elizabeth”
When using the corrected adversarial prompt, the probability of S (“Elizabeth”) increases while the probability of IO (“David”) decreases.
Thanks a lot for spotting the typo, we corrected the post!
A way we think the circuit can differ depending on examples is if there are different semantic meaning involved. For instance, in the example above, the object given is a “bone” such that a “a dog” could also be a plausible prediction. If “Elizabeth decided to give a kiss”, then the name of a human seems more plausible. If this is the case, then there should be additional components interfering with the circuit we described to incorporate information about the meaning of the object.
In addition to semantic meaning, there could be different circuits for each template, different circuits could be used to handle different sentence structures.
In our study we did not investigate what differ between specific examples as we’re always averaging experiments results on the full distribution. So in this way the circuit we found is not complete, as we can not explain the full distribution of the model outputs. However, we would expect that each circuit would be a variation of the circuit we described in the paper.
There are other ways we think our circuit is not complete, see the section 4.1 for more experiments on these issues.
Exciting work! A minor question about the paper:
Does this mean that it writes a projection of S1′s positional embedding to S2′s residual stream? Or is it meant to say “writing to the position [residual stream] of [S2]”? Or something else?
Thanks for the feedback!
Our current hypothesis is that they write some information about S1′s position (that we called the “position signal”, not as straightforward as a projection of its positional embedding) in the residual stream of S2. (See the paragraph “Locating the position signal.” in section 3.3). I hope this answer your questions.
We currently think that the position signal is a relative pointer from S2 to S1, computed by the difference between the positions S2 and S1. However, our evidence for this claim is quite small (see the last paragraph of Appendix A).
That’s definitely an exciting direction for future research!