Help out Redwood Research’s interpretability team by finding heuristics implemented by GPT-2 small

Some of Redwood’s current research involves finding specific behaviors that language models exhibit, and then doing interpretability to explain how the model does these behaviors. One example of this is the indirect object identification (IOI) behavior, investigated in a forthcoming paper of ours: given the input When John and Mary went to the store, Mary gave a flower to, the model completes John instead of Mary. Another example is the acronym generation task: given the input In a statement released by the Big Government Agency (, the model completes BGA).

We are considering scaling up this line of research a bunch, and that means we need a lot more behaviors to investigate! The ideal tasks that we are looking for have the following properties:

  1. The task arises in a subset of the training distribution. Both the IOI and the acronym tasks are abundant in the training corpus of language models. This means that we are less interested in tasks specific to inputs that never appear in the training distribution.

    1. The ideal task can even be expressed as a regular expression that can be run on the training corpus to obtain the exact subset, as is the case for acronyms. The IOI task is less ideal in this sense, since it is harder to identify the exact subset of the training distribution that involves IOI.

  2. There is a simple heuristic for the task. For IOI, the heuristic is “fill in the name that appeared only once so far”. For acronyms, the heuristic is “string together the first letter of each capitalized word, and then close the parentheses”. Note that the heuristic does not have to apply to every instance of the task in the training distribution, e.g. sometimes an acronym is not formed by simply taking the first letter of each word.

    1. The gold standard here is if the heuristic can be implemented in an automated way, e.g. as a Python function, but we would also consider the task if a human is needed to supply the labels.

  3. GPT-2 small implements this heuristic. We are focusing on the smallest model in the GPT-2 family right now, which is a 117M parameter model that is frankly not that good at most things. We are moderately interested in tasks that bigger models can do that GPT-2 small can’t, but the bar is at GPT-2 small right now.


The following is a list of tasks that we have found so far/​are aware of. Induction and acronym generation remain the tasks that best meet all of the above desiderata.

  • Induction: If the model has seen … A B … in the input, then it is more likely to output B immediately after A when encountering A again. A common use case of induction is remembering complicated names: T|cha|ik|ovsky … T|cha|ik|ovsky

  • Acronym generation

  • IOI

  • Email address generation: Hi! My name is John Smith, and I work at Apple. My email address

  • Subject-verb agreement: The keys to the cabinetare and not is, even though the singular “cabinet” is also present in the prompt. (Credit to Nate Thomas)

  • Incrementing citation numbers: Most metals are excellent conductors [1]. However, this is not always the case [2] (Credit to Lucas Sato)

  • Variations of induction and IOI

    • Extended IOI: identify the right person out of a group with more than two names

    • Inferring someone’s last name from name of relative

    • <img alt=”A man in a hat” src=”man_in_hat.jpg”><img alt=”A dog in my house” src=” → dog_in_house.jpg”> (Credit to Rob Miles)

Some examples that we are less excited about include:

  • Closing doc strings, parentheses, etc

  • Close variations of known examples

We would love for interested people to contribute ideas! Below are some resources we put together to make the search as easy as possible:

  • You can play with GPT-2 small here in a web app that we built specifically for investigating model behaviors. Alternatively, GPT-2 is also available on the HuggingFace website, or you could use to play with Fairseq-125M, which is approximately as good as GPT-2 small. If you find that you need more sophisticated tools to interact with the model, I would be potentially down to supply you with them. Neel Nanda also has a bounty out on Bountied Rationality for essentially the same thing; feel free to checkout the resources he provides and/​or also participate in the bounty.

  • It might be useful to keep in mind what kinds of texts GPT-2 small was trained on. This is the description of the training corpus according to HuggingFace: “The OpenAI team wanted to train this model on a corpus as large as possible. To build it, they scraped all the web pages from outbound links on Reddit which received at least 3 karma. Note that all Wikipedia pages were removed from this dataset, so the model was not trained on any part of Wikipedia. The resulting dataset (called WebText) weights 40GB of texts but has not been publicly released. You can find a list of the top 1,000 domains present in WebText here.” Here is a website that shows random snippets of OpenWebText (the open clone of the unreleased WebText) whenever you refresh it.

  • Some folks found that playing the language modeling game is helpful for coming up with these tasks.


  • What counts as the model implementing a heuristic successfully? In general, we just want the model’s output to correlate with the answer expected from the heuristic. This means that we don’t demand that the model always outputs the correct answer, or that the model puts the highest probability on the expected answer. We just care that the model puts high enough probability on the expected answer. With the tools we have, it might still be difficult to evaluate exactly how good the model is at a given task. When in doubt, feel free to post the task, just qualify it with your level of confidence in the model’s ability to do it.

  • We encourage people to post the examples they find directly in the comments of this thread! Your examples might serve as inspiration for others. For convenient sharing, there is a “Share to URL” button on the model behavior search website that copies a URL that leads to the current state of the web app to clipboard.