Why I’m Working On Model Agnostic Interpretability

Work done @ SERI-MATS.

This is the first in a short series of short posts about interpretability. In this post, I’m collecting some thoughts on why model agnostic interpretability is a worthwhile pursuit. I’ll assume that the reader is sympathetic to arguments for interpretability in general. If you’re not, maybe Neel can help.

Model agnostic interpretability methods are those which treat the model in question as a black box. They don’t require access to gradients or activations, and make no assumptions about the model’s architecture. The model inside could be a support vector machine; a deep neural network; a reinforcement learning agent; a set of water filled pipes; or a human in a box with a set of instructions: any system that produces some output in response to some input. This is in contrast to model specific interpretability methods, which either require access to the internal state of the model, or make assumptions about its architecture.

Model agnostic interpretability methods are entirely perturbation-based, meaning that they consist of various different ways of changing the input, and looking at how the output changes (what else is there to do?). It turns out that there are many ways to do this, and I will refer you to other excellent overviews rather than reiterating them here.

Here’s an example of perturbation-based saliency mapping, a model agnostic interpretability method. Parts of the input are iteratively perturbed, and the resulting changes in the logit for the class ‘dog’ are mapped to the location of those perturbations.

Some of these methods (like perturbation-based saliency mapping) work with any kind of data. You could perform the same kind of iterative perturbation upon time-series, or text, or tabular inputs, or RL environments in a pretty straightforward manner. Other methods (like feature visualisation) rely on a searchable input space, which makes them harder to apply to arbitrary input types (although I suspect not impossible – more on that in an upcoming post).

Model agnostic methods have some nice properties:

  • They are able to provide comparisons across different models with different architectures, with low engineering overhead.

  • They are able to capture gestalt, global phenomena in model behaviour, in a way that local circuits-style interpretability is not.

  • Most importantly, they are robust to paradigm shifts in model architecture.

This last property is the one I’m most interested in. What if the looming AGI that keeps us up at night is not GPT-X, but some other architecture that our current interpretability methods won’t transfer to? What if all the excellent people doing excellent interpretability work right now are building and learning things that will turn out to be irrelevant? Is this a legitimate concern?

Some questions:

1) How difficult is it to adapt model specific interpretability methods to arbitrary novel architectures? I plan to spend some research time on this in the near future. If it’s quite difficult, then working on model agnostic methods is important. My intuition is that adapting existing model specific interpretability methods is probably non-trivial, and that’s if we assume that the novel architecture is similar in kind: i.e. feed-forward neural networks trained using gradient descent.

2) How likely are we to see a paradigm shift in model architectures (that leads to AGI) large enough to break existing interpretability methods? (And, how long will we have before such a shift results in dangerous AGI? Will we have time to develop model specific interpretability methods for the new paradigm?) If this is likely (or, given the stakes, just possible), then working on model agnostic methods is important. I’m quite uncertain about this, and I expect opinions to differ widely, probably strongly correlated with timelines.

It seems to me that there is a moderately strong case to be made for allocating resources to this kind of work, if the answer to question one is ‘non-trivial’ and the answer to question two is at least ‘somewhat likely’. I think these are reasonable answers (and, no-one else seems to be doing model agnostic interpretability research) – so here I am.