Seeking Feedback on My Mechanistic Interpretability Research Agenda

Why this post

I’ve been doing MI research full-time for about three months and since my current grant is ending soon and I recently received a Lightspeed rejection, now seems like a good time to take some time away from object-level work to reflect on direction and next steps. I spent a few hours writing a draft, realized I thought it was too complicated (for the sake of sounding complicated?) and rewrote what I really want to do and why in ~750 words. I intend to incorporate feedback in an OpenPhil early career application (which I will submit in a few days). I am also applying to AI research position(s) and if I receive a position, this will help to inform what to focus on.

Agenda

In my mind, the goal of mechanistic interpretability research is to sufficiently understand what state-of-the-art networks are doing, so we can examine whether any dangerous behaviors are occurring. I expect automated techniques to be an important step towards understanding such models (Examples: 1, 2, 3). However, I feel we are missing an important foundation since we’ve yet to fully understand what small language models are doing, and thus, don’t know the set of things that automated techniques should be detecting. (Should these techniques be searching for reasoning?, specific behaviors?, memorized information?, superposition?). I believe that we first need to conduct a meticulous, low-level exploration of shallow language models to guide our analysis of larger models.

With the release of the TinyStories models, we now have small language models that produce reasonable text completions (of children’s stories). TinyStories-1Layer-21M is a one-layer model in this family that will serve as the focus of my research. I will conduct low-level mechanistic interpretability on this model using Anthropic’s Circuits Framework. I believe that analyzing this network using circuits will be feasible given its relatively smaller size and the limited number of pathways through the network.

I strongly believe that there will be rules and patterns that exist in this 1-layer model that will serve as a guideline of things we can search for in larger models. (Of course, there are almost certainly rules and patterns in larger models trained on a more diverse data set that do not exist in this model, but this should serve as a good starting point to build a mechanistic understanding of what language models are doing). My hope is that with focused low-level exploration into weights and activations of a one-layer model, I can build a library of human interpretable knowledge and functionality that this 1-layer model seems to have learned.

I’ve started to explore this model, so I have some rough thoughts about components that might exist, but these claims should be understood to be preliminary: I believe I have detected MLP neurons that seem to relate to a certain concept, for example, I’ve detected several neurons that seems to be strongly positive when characters in a story are “looking” or “searching” for something; additionally, these neurons’ contributions to the residual stream generally support output tokens that also have to do with “looking” or “searching”.

On the other hand, some neurons seem to be support syntax rules. One neuron is strongly positive for pronouns; its contribution to the residual stream generally supports verb and adverb output tokens. (I → am, They → look, they → see, They → both [feel]). Another is strongly positive for articles; it’s contribution to the residual stream generally supports adjectives and nouns (A → banana, the-> trees, a-> big [banana]). Is the model deliberately composing content and syntax rules to successfully write stories about “searching” using proper grammar? Are there additional types of rules that are being composed as well?

Once I have a library of knowledge and functionality, I will try to handcraft network weights that can implement this behavior, similar to Chris Olah’s creation of curve detectors and my past work handcrafting weights for a 1-layer transformer. Handcrafting network weights allowed me to detect a piece of the network’s overall mechanism that I had previously missed. I expect handcrafting weights will likewise point out areas of my overall explanation of TinyStories-1Layer-21M that are incomplete or missing.

Over the last three months, I’ve successfully interpreted three small transformer models (Stephen Casper’s MI transformer challenge, ARENA monthly problem #1, ARENA monthly problem #2). I am confident in my ability to use Pytorch to conduct circuit-style analysis to manipulate and combine network weights and activations, ultimately creating annotated graphs that I can examine to generate and test hypotheses.

Understanding how 1-layer models work will serve as a baseline to understand 2-layer models. With models that are 2-layers (or more) we have attention head composition, which likely unlocks behaviors not available to 1-layer models. By having a sense of what exists in 1-layer models, it is likely that 2-layer models will exhibit some of the same patterns and some different patterns. After exploring 2-layer models, we can hopefully continue to expand, again using the smaller models as a guide for some things we might expect to find in larger models. Hopefully, this process ultimately allows us to better understand what large language models are doing so that we can better evaluate safety questions.

Conclusion

Thank you for taking the time to read this. Again, I am open to all feedback including negative feedback.

If you are a funder or employer potentially interested in discussing and/​or funding this work, please reach out over LessWrong DMs.

If you have thoughts about organizations that might fund this research, again, please feel free to DM or post in the comments.