Help out Redwood Research’s interpretability team by finding heuristics implemented by GPT-2 small
Some of Redwood’s current research involves finding specific behaviors that language models exhibit, and then doing interpretability to explain how the model does these behaviors. One example of this is the indirect object identification (IOI) behavior, investigated in a forthcoming paper of ours: given the input When John and Mary went to the store, Mary gave a flower to, the model completes John instead of Mary. Another example is the acronym generation task: given the input In a statement released by the Big Government Agency (, the model completes BGA).
We are considering scaling up this line of research a bunch, and that means we need a lot more behaviors to investigate! The ideal tasks that we are looking for have the following properties:
The task arises in a subset of the training distribution. Both the IOI and the acronym tasks are abundant in the training corpus of language models. This means that we are less interested in tasks specific to inputs that never appear in the training distribution.
The ideal task can even be expressed as a regular expression that can be run on the training corpus to obtain the exact subset, as is the case for acronyms. The IOI task is less ideal in this sense, since it is harder to identify the exact subset of the training distribution that involves IOI.
There is a simple heuristic for the task. For IOI, the heuristic is “fill in the name that appeared only once so far”. For acronyms, the heuristic is “string together the first letter of each capitalized word, and then close the parentheses”. Note that the heuristic does not have to apply to every instance of the task in the training distribution, e.g. sometimes an acronym is not formed by simply taking the first letter of each word.
The gold standard here is if the heuristic can be implemented in an automated way, e.g. as a Python function, but we would also consider the task if a human is needed to supply the labels.
GPT-2 small implements this heuristic. We are focusing on the smallest model in the GPT-2 family right now, which is a 117M parameter model that is frankly not that good at most things. We are moderately interested in tasks that bigger models can do that GPT-2 small can’t, but the bar is at GPT-2 small right now.
The following is a list of tasks that we have found so far/are aware of. Induction and acronym generation remain the tasks that best meet all of the above desiderata.
Induction: If the model has seen … A B … in the input, then it is more likely to output B immediately after A when encountering A again. A common use case of induction is remembering complicated names: T|cha|ik|ovsky … T| → cha|ik|ovsky
Email address generation: Hi! My name is John Smith, and I work at Apple. My email address is → email@example.com
Subject-verb agreement: The keys to the cabinet → are and not is, even though the singular “cabinet” is also present in the prompt. (Credit to Nate Thomas)
Incrementing citation numbers: Most metals are excellent conductors . However, this is not always the case [ → 2] (Credit to Lucas Sato)
Variations of induction and IOI
Extended IOI: identify the right person out of a group with more than two names
Inferring someone’s last name from name of relative
<img alt=”A man in a hat” src=”man_in_hat.jpg”><img alt=”A dog in my house” src=” → dog_in_house.jpg”> (Credit to Rob Miles)
Some examples that we are less excited about include:
Closing doc strings, parentheses, etc
Close variations of known examples
We would love for interested people to contribute ideas! Below are some resources we put together to make the search as easy as possible:
You can play with GPT-2 small here in a web app that we built specifically for investigating model behaviors. Alternatively, GPT-2 is also available on the HuggingFace website, or you could use goose.ai to play with Fairseq-125M, which is approximately as good as GPT-2 small. If you find that you need more sophisticated tools to interact with the model, I would be potentially down to supply you with them. Neel Nanda also has a bounty out on Bountied Rationality for essentially the same thing; feel free to checkout the resources he provides and/or also participate in the bounty.
It might be useful to keep in mind what kinds of texts GPT-2 small was trained on. This is the description of the training corpus according to HuggingFace: “The OpenAI team wanted to train this model on a corpus as large as possible. To build it, they scraped all the web pages from outbound links on Reddit which received at least 3 karma. Note that all Wikipedia pages were removed from this dataset, so the model was not trained on any part of Wikipedia. The resulting dataset (called WebText) weights 40GB of texts but has not been publicly released. You can find a list of the top 1,000 domains present in WebText here.” Here is a website that shows random snippets of OpenWebText (the open clone of the unreleased WebText) whenever you refresh it.
Some folks found that playing the language modeling game is helpful for coming up with these tasks.
What counts as the model implementing a heuristic successfully? In general, we just want the model’s output to correlate with the answer expected from the heuristic. This means that we don’t demand that the model always outputs the correct answer, or that the model puts the highest probability on the expected answer. We just care that the model puts high enough probability on the expected answer. With the tools we have, it might still be difficult to evaluate exactly how good the model is at a given task. When in doubt, feel free to post the task, just qualify it with your level of confidence in the model’s ability to do it.
We encourage people to post the examples they find directly in the comments of this thread! Your examples might serve as inspiration for others. For convenient sharing, there is a “Share to URL” button on the model behavior search website that copies a URL that leads to the current state of the web app to clipboard.
- Apply to the Redwood Research Mechanistic Interpretability Experiment (REMIX), a research program in Berkeley by 27 Oct 2022 1:32 UTC; 134 points) (
- Apply to the Redwood Research Mechanistic Interpretability Experiment (REMIX), a research program in Berkeley by 27 Oct 2022 1:39 UTC; 95 points) (EA Forum;
- Some Lessons Learned from Studying Indirect Object Identification in GPT-2 small by 28 Oct 2022 23:55 UTC; 95 points) (
- EA & LW Forums Weekly Summary (10 − 16 Oct 22′) by 17 Oct 2022 22:51 UTC; 24 points) (EA Forum;
- EA & LW Forums Weekly Summary (10 − 16 Oct 22′) by 17 Oct 2022 22:51 UTC; 12 points) (
Unit conversion, such as
“Fresno is 204 miles (329 km) northwest of Los Angeles and 162 miles (” → 261 km)
“Fresno is 204 miles (329 km) northwest of Los Angeles and has an average temperature of 64 F (” → 18 C)
“Fresno is 204 miles (” → 329 km)
Results: 1, 2, 3. It mostly gets the format right (but not the right numbers).
This is an interesting one! It looks like there might be some very rough heuristics going on for the number part as well, e.g. the model knows the number in km is almost definitely 3 digits.
Reversing text w/ 1 example:
“Mike is large → large is Mike
Bob is cute → cute is”
Also works w/ numbers (but I had trouble getting it to reverse 3 digits at a time):
”3 6 → 6 3
2 88 ->”
Ignoring a zero
“1 + 1 = 0 + 2
2 + 2 = 0 + 4
3 + 3 = 0 +”
Which also worked when replacing 0 w/ “pig”, but changing it to “df” made it predict ” 5″ as the answer, which I think it just wants to count up from the previous answer 4.
Parallel structure w/ Independent
For each of the following, the model predicts a ”.” at the end.
I eat spaghetti, yet she eats pizza
I slept, for I was sleepy
I can eat, or I can sleep
I love my dog, and my dog loves me
I can neither eat, nor can I sleep
Some n-grams overpower this effect. In the above “yet she eats ice” will be followed by ” cream”. “I was sleepy, so I slept” will be followed by ” in”.
Three Items in a List
“She ate the cookies, cake” will be followed by ”,” and then ” and”.
[Note: the language modelling game and the gpt-2 small search tool a were very useful]
Thanks for contributing these! I’m not sure I understand the one about ignoring a zero: is the idea that it can not only do normal addition, but also addition in the format with a zero?
Completing Incomplete Quotations
Pattern: [“<incomplete quoted statement>,” <descriptor of speaker> said,] → [“<completion of sentence following from previous quotation><...>]
Example: [“When the truth is replaced by silence,” the Soviet dissenter said,] → [ “it will be impossible to hold securely everything.] (prediction starts with [ “] ~71% of the time)
The next token will be [ “] ~45-70% of the time when the original quotation is obviously incomplete.
When the original quotation looks more like a complete sentence, the next token will be [ “] only ~5-20% of the time (see counterexample below).
Counter Example (initial quotation is a complete statement; in this case removed ‘When’):
[“The truth is replaced by silence,” the Soviet dissenter said,] → [adding that the TV show was a farcical] (prediction starts with [ “] only ~12% of the time)
‘From’ - ‘To’ Numeric Symmetry
Pattern: [from <member of numeric class> to] → [ <different member of numeric class>]
[from 1874 to] → [ 1882]
[from March 34, 1999 to] → [ May 12, 2004]
[from 5:40 am to] → [ 8:00 am]
[from 30 degrees to] → [ 100 degrees]
[from 89 to] → [ 93]
[from 154 to] → [ 195]
[from 12539 to] → [ 13114]
[from 2,631,254,399 to] → [ 3,021,133,526]
Maintains symmetry between plausible years/dates/times/temperatures. In the case of dates/times is heavily biased towards predicting a higher value after ‘to’ (as would be expected from the training corpus). Also maintains symmetry of number of digits in arbitrary numbers that don’t fall into an obvious class, though this starts losing exactness past 5 digits (but still remains roughly symmetric). Interestingly, exactness of number of digits for larger numbers improves substantially when commas are added to the number (e.g. 1,000,000).
Syntactically Correct HTTP URL Generation/Completion
Pattern: [https://] → [<syntactically valid and real-looking URL containing a domain, resource, sometimes query parameters, etc>]
[https://] → [www.parks.org/programs/]
[http://wowthisissocool] → [380.blogspot.com/2015/03/]
[https://ibetthiswillgetqueryparams.com] → [/submit?inc=false&type=Out]
Beyond being merely syntactically valid, common URL resource nesting patterns are observed, like the [/<year>/<month>/] pattern above, or [/<resource>/<id>].
Thanks for these! I love the ‘from’ → ‘to’ one: it seems GPT-2 small clearly knows the rough ordering of numbers in various formats, although when I was playing with it and trying to get it to do addition in real life settings, it appears quite bad at actually knowing how numbers work.
“Either”, “or” pairs in text.
Heuristic. If the word either appears in a sentence, wait for the comma and then add an ” or”.
What follows are a few examples. Note that the completion is just something I randomly come up with, the important part is the or. Using the webapp, GPT-2 puts a high probability (around 40%-60%) on the token ” or”.
“Either you take a left at the next intersection,” → or take a left after that.
“Either you go to the cinema,” → or you stay at home.
“Tonight I could either order some food,” → or cook something myself.
“Do you rather want to go to Portugal or Italy? Either” → way is fine./one is fine. (GPT-2 puts a lot of probability on ” way”, and barely any on ” or”, which is correct).
Thanks! There are probably other grammatical structures in English that require a bit of an algorithmic thinking like this one as well.
I found some behaviors, but I’m not sure this is what you are looking for because the algorithm in both is quite simple. I’d appreciate feedback on them.
Incrementing days of the week
“If today is Monday, tomorrow is Tuesday. If today is Wednesday, tomorrow is” → “Thursday”
“If today is Monday, tomorrow is Tuesday. If today is Thursday, tomorrow is” → “Friday”
This also works with zero-shot prompting although the effect isn’t as strong. eg:
“If today is Friday, tomorrow is” → “Saturday”
“Lisa is great. I really like” → “her”
“John is great. I really like” → “him”