A Walkthrough of Interpretability in the Wild (w/​ authors Kevin Wang, Arthur Conmy & Alexandre Variengien)

Link post

New paper walkthrough: Interpretability in the Wild: A Circuit for Indirect Object Identification In GPT-2 Small is a really exciting new mechanistic interpretability paper from Redwood Research. They reverse engineer a 26(!) head circuit in GPT-2 Small, used to solve Indirect Object Identification: the task of understanding that the sentence “After John and Mary went to the shops, John gave a bottle of milk to” should end in Mary, not John.

I think this is a really cool paper that illustrates how to rigorously reverse engineer real models, and is maybe the third example of a well understood circuit in a real model. So I wanted to understand it better by making a walkthrough. (and hopefully help other people understand it too!) In this walkthrough I’m joined by three of the authors, Kevin Wang, Arthur Conmy and Alexandre Variengien. In Part 1 we give an overview of the high-level themes in the paper and what we think is most interesting about it. If you’re willing to watch an hour of this and want more, in Part 2 we do a deep dive into the technical details of the paper and read through it together, and dig into the details of how the circuit works and the techniques used to discover this.

If you’re interested in contributing to this kind of work, apply to Redwood’s REMIX program! Deadline Nov 13th

We had some technical issues in filming this, and my video doesn’t appear—sorry about that! I also tried my hand at some amateur video editing to trim it down—let me know if this was worthwhile, or if it’s bad enough that I shouldn’t have bothered lol. If you find this useful, you can check out my first walkthrough on A Mathematical Framework for Transformer Circuits. And let me know if there are more interpretability papers you want walkthroughs on!

If you want to try exploring this kind of work for yourself, there’s a lot of low-hanging fruit left to pluck! Check out my EasyTransformer library and the accompanying codebase to this paper! A good starting point is their notebook giving an overview of the key experiments or my notebook modelling what initial exploration on any task like this could look like. Some brainstormed future directions:

  • 3 letter acronyms (or more!)

  • Converting names to emails.

    • An extension task is e.g. constructing an email from a snippet like the following: Name: Neel Nanda; Email: last name dot first name k @ gmail

  • Grammatical rules

    • Learning that words after full stops are capital letters

    • Verb conjugation

    • Choosing the right pronouns (e.g. he vs she vs it vs they)

    • Whether something is a proper noun or not

  • Detecting sentiment (eg predicting whether something will be described as good vs bad)

  • Interpreting memorisation. E.g., there are times when GPT-2 knows surprising facts like people’s contact information. How does that happen?

  • Counting objects described in text. E.g.: I picked up an apple, a pear, and an orange. I was holding three fruits.

  • Extensions from Alex Variengien

    • Understanding what’s happening in the adversarial examples: most notably S-Inhibition Head attention pattern (hard). (S-Inhibition heads are mentioned in the IOI paper)

    • Understanding how are positional signal encoded (relative distance, something else?) bonus point if we have a story that include the positional embeddings and that explain how the difference between position is computed (if relative is the right framework) by Duplicate Token Heads /​ Induction Heads. (hard, but less context dependant)

    • What are the role of MLPs in IOI (quite broad and hard)

    • What is the role of Duplicate Token Heads outside IOI? Are they used in other Q-compositions with S-Inhibition Heads? Can we describe how their QK circuit implement “collision detection” at a parameter level? (Last question is low context dependant and quite tractable)

    • What is the role of Negative/​ Backup/​ regular Name Movers Heads outside IOI? Can we find examples on which Negative Name Movers contribute positively to the next-token prediction?

    • What are the differences between the 5 inductions heads present in GPT2-small? What are the heads they rely on /​ what are the later heads they compose with (low context dependence form IOI)

    • Understanding 4.11, (a really sharp previous token heads) at the parameter level. I think this can be quite tractable given that its attention pattern is almost perfectly off-diagonal

    • What are the conditions for compensation mechanisms to occur? Is it due to drop-out? (Arthur Conmy is working on this—feel free to reach out to arthur@rdwrs.com )

  • Extensions from Arthur Conmy

    • Understand IOI in GPT-Neo: it’s a same size model but does IOI via composition of MLPs

    • Understand IOI in the Stanford mistral models—they all seem to do IOI in the same way, so maybe look at the development of the circuit through training?

Some of the links mentioned in the video: