The Ground Truth Problem (Or, Why Evaluating Interpretability Methods Is Hard)

Work done @ SERI-MATS.

Evaluating interpretability methods (and so, developing good ones) is really hard because we have no ground truth. Or at least, no ground truth that we can compare our interpretations directly against.

The ground truth of a model’s behaviour is provided by that model’s architecture and its learned parameters. But, puny humans are unable to interpret this: it’s precise, in that it accurately explains the model’s behaviour, but it’s not interpretable. On the other end of the spectrum we have something like “This model classifies cats” – a statement that is really easy to interpret, but lacks something in the way of precision.

Precise <---------------------------------> Interpretable

^ Useful?

Imagine two interpretations, each generated by a different method with respect to the same model (say, a cat classifier). Method A indicates that the model has learned to use ears and whiskers to identify cats. Method B indicates that it uses eyes and tails. Assuming both are easy to interpret, can we tell which method is most precise? Which most faithfully represents what the model is truly doing?

If we had a method that reconciled precision and interpretability, how would we know?

Well, we can perform sanity checks on the interpretability methods, and throw away any that fail them. This seems good – it’s at least objective – but it only really allows us to throw away obviously bad approaches. It doesn’t say anything about what to do when sane interpretability methods disagree.

We could also look at the interpretations and see if they appear sensible to us. This is a widely used approach (Zeiler et al., Petsuik et al., Fong et al., many many more), and I think it’s a terrible idea.

Example:

  • We’ve made some new interpretability method that is supposed to help us understand which words are used by a language model to identify hate speech in tweets. To see if it works properly, we compare the words highlighted by this interpretability method to the actual hateful words in the tweet. It gets them right! Our new interpretability method works!

  • NO! We have fallen prey to a terrible assumption: that if a model performs well, it has learned to use the same features that a human would use. How we would perform a task is not the ground truth. How the model actually performs a task is, but we don’t know that – it’s what we’re trying to find out!

Example:

  • We use gradient descent to optimise the input to a model such that it maximises the activation of a particular node, layer, or logit. This works, but results in a really noisy input that doesn’t make sense to us – it seems like adversarial noise. So, we regularise the input, perturb it intermittently during optimisation, constrain it to the training data distribution, and voila – we have a nice optimised input that makes sense! We found a fur detector!

  • NOOOOOO! We’ve optimised the input to maximise the some output, sure. But we’ve also optimised it to maximise how much we like it. That’s not what we wanted! That has nothing to do with what the model has actually learned, and how sensible an interpretation seems to us has no relation to the ground truth.

I’m being a bit dramatic. These kinds of approaches can be useful, and god knows I love a good feature visualisation, same as anyone. But I’m worried about using stuff like this to determine how good our interpretability methods are. It’s not an objective evaluation.

A small idea: what if we did have access to the ground truth? If we had a small, simple model that we completely understood (I’m looking at you, mechanistic interpretability people), we could use it as a truly objective benchmark for other interpretability methods. (This is super easy for model agnostic saliency mapping – just use summation in place of the model, and then the ground truth saliency of each input element is exactly that element itself. If your saliency mapping method isn’t exactly the same as the input, it’s not working perfectly – and moreover, you can see exactly where it’s failing.)