The Strange Science of Interpretability: Recent Papers and a Reading List for the Philosophy of Interpretability

Link post

TL;DR: We recently released two papers about the Philosophy of (Mechanistic) Interpretability [here and here] and a reading list [here]. We believe that building a foundation for interpretability which leverages lessons from other disciplines (philosophy, neuroscience, social science) can help us understand AI models. We also believe this is a useful area for philosophers, neuroscientists and human-computer interaction (HCI) researchers to contribute to AI Safety. If you’re interested in this project (especially if interested in contributing or collaborating) please reach out at koayon@gmail.com, apply to one of my SPAR projects, or message me on Slack if we share a channel.

Mechanistic Interpretability (MechInterp) is a field looking to make progress on the problems of understanding the internal mechanisms of neural networks. Though MechInterp is a relatively new field, the shape of many of the problems and solutions have been studied in other contexts. For example, characterising and intervening on neural representations has a rich literature in (the Philosophy of) Neuroscience; understanding what makes causal-mechanistic explanations useful is a topic of conversation in the Philosophy of Science and how humans learn from explanations is an empirical topic in the Social Sciences.

The Strange Science is a series of papers about the Philosophy of Interpretability and aims to adapt and develop theory from the Philosophies of Science, Neuroscience and Mind to help with practical problems in Mechanistic Interpretability. We recently released the first two papers in the series.

The first paper titled A Mathematical Philosophy of Explanations in Mechanistic Interpretability has the following abstract:

Mechanistic Interpretability aims to understand neural networks through causal explanations. We argue for the Explanatory View Hypothesis: that Mechanistic Interpretability research is a principled approach to understanding models because neural networks contain implicit explanations which can be extracted and understood. We hence show that Explanatory Faithfulness, an assessment of how well an explanation fits a model, is well-defined. We propose a definition of Mechanistic Interpretability (MI) as the practice of producing Model-level, Ontic, Causal-Mechanistic, and Falsifiable explanations of neural networks, allowing us to distinguish MI from other interpretability paradigms and detail MI’s inherent limits. We formulate the Principle of Explanatory Optimism, a conjecture which we argue is a necessary precondition for the success of Mechanistic Interpretability.

The second paper is titled Evaluating Explanations: An Explanatory Virtues Framework for Mechanistic Interpretability and has the following abstract:

Mechanistic Interpretability (MI) aims to understand neural networks through causal explanations. Though MI has many explanation-generating methods, progress has been limited by the lack of a universal approach to evaluating explanations. Here we analyse the fundamental question “What makes a good explanation?” We introduce a pluralist Explanatory Virtues Framework drawing on four perspectives from the Philosophy of Science—the Bayesian, Kuhnian, Deutschian, and Nomological—to systematically evaluate and improve explanations in MI. We find that Compact Proofs consider many explanatory virtues and are hence a promising approach. Fruitful research directions implied by our framework include (1) clearly defining explanatory simplicity, (2) focusing on unifying explanations and (3) deriving universal principles for neural networks. Improved MI methods enhance our ability to monitor, predict, and steer AI systems.

We also released a reading list for Philosophy of Interpretability here which is open-source and accepting contributions.

We would be excited about hearing from interpretability researchers, ML researchers, philosophers, neuroscientists, human-computer interaction researchers and social scientists who are interested in this topic. If you’re interested in contributing or collaborating please reach out at koayon@gmail.com, apply to one of my SPAR projects, or message me on Slack if we share a channel.

A huge thanks to Louis Jaburi, my co-conspirator for the first two papers! Also a massive thanks to everyone who read drafts of the paper including: Nora Belrose, Matthew Farr, Sean Trott, Elsie Jang, Evžen Wybitul, Andy Artiti, Owen Parsons, Kristaps Kallaste and Egg Syntax. We appreciate Daniel Filan and Joseph Miller’s helpful feedback. Thanks to Mel Andrews, Alexander Gietelink Oldenziel, Jacob Pfau, Michael Pearce, Samuel Schindler, Catherine Fist, Lee Sharkey, Jason Gross, Joseph Bloom, Nick Shea, Barnaby Crook, Eleni Angelou, Dashiell Stander, Geoffrey Irving and attendees of the ICML2024 MechInterp Social for useful conversations.