Part 7 of 12 in the Engineer’s Interpretability Sequence.

Thanks to Neel Nanda. I used some very nicely-written code of his from here. And thanks to both Chris Olah and Neel Nanda for briefly discussing this challenge with me.

MI = “mechanistic interpretability”

Given a network, recover its labeling function’s program.

In the last post, I argued that existing works in MI focus on solving problems that are too easy. Here, I am posing a challenge for mechanists that is still a toy problem but one that is quite a bit less convenient than studying a simple model or circuit implementing a trivial, known task. To the best of my knowledge:

Unlike prior work on MI from the AI safety interpretability community, beating this challenge would be the first example of mechanistically explaining a network’s solution to a task that was not cherrypicked by the researcher(s) doing so.

Gaining a mechanistic understanding of the models in this challenge may be difficult, but it will probably be much less difficult than mechanistically interpreting highly intelligent systems in high stakes settings in the real world. So if an approach can’t solve the type of challenge posed here, it may not be very promising for doing much heavy lifting with AI safety work.

This post comes with a GitHub repository. Check it out here. The challenge is actually two challenges in one, and the basic idea is similar to some ideas presented in Lindner et al. (2023).

Challenge 1, MNIST CNN

I made up a nonlinear labeling function that labels approximately half of all MNIST images as 0’s and the other half as 1’s. Then I trained a small CNN on these labels, and it got 96% testing accuracy. The challenge is to use MI tools on the network to recover that labeling function.

Hint 1: The labels are binary.

Hint 2: The network gets 95.58% accuracy on the test set.

Hint 3: The labeling function can be described in words in one sentence.

Hint 4: This image may be helpful.

Challenge 2, Transformer

I made up a labeling function that takes in two integers from 0 to 113 and outputs either a 0 or 1. Then, using a lot of code from Neel Nanda’s grokking work, I trained a 1-layer transformer on half of the data. It then got 97% accuracy on the test half. As before, the challenge is to use MI tools to recover the labeling function.

Hint 1: The labels are binary.

Hint 2: The network is trained on 50% of examples and gets 97.27% accuracy on the test half.

Hint 3: Here are the ground truth and learned labels. Notice how the mistakes the network makes are all near curvy parts of the decision boundary...

Prizes

If you are the first person to send me the labeling function’s program and a justified mechanistic explanation for either challenge (e.g. in the form of a colab notebook), I will sing your praises on my Twitter, and I would be happy to help you write a post about how you solved a problem I thought would be very difficult. Neel Nanda and I are also offering a cash prize. (Thanks to Neel for offering to contribute to the pool!) Neel will donate $250, and I will donate $500 to a high-impact charity of choice for the first person to solve each challenge. That makes the total donation prize pool $1,500.

Good luck

For this challenge, I intentionally designed the labeling functions to not be overly simple. But I will not be too surprised if someone reverse-engineers them with MI tools, and if so, I will be extremely interested in how.

Neither of the models perfectly label the validation set. One may object that this will make the problem unfairly difficult because if there is no convergence on the same behavior as the actual labeling function, then how is one supposed to find that function inside the model? This is kind of the point though. Real models that real engineers have to work with don’t tend to conveniently grok onto a simple, elegant, programmatic solution to a problem.

Here’s the GitHub link again.

Good luck!

Questions

Do you agree or disagree that if mechanistic interpretability tools are unable to solve problems like these that they are unlikely to be very useful for safety?
Are there reasons to expect that systems that are larger or trained to be more intrinsically interpretable may be substantially easier to interpret than small models like these?

EIS VII: A Challenge for Mechanists