This is basically 3 months worth of Alignment Newsletters focused solely on interpretability wrapped up into a single post. The authors provide summaries of 70 (!) papers on the topic, and include links to another 90. I’ll focus on their opinions about the field in this summary.
The theory and conceptual clarity of the field of interpretability has improved dramatically since its inception. There are several new or clearer concepts, such as simulatability, plausibility, (aligned) faithfulness, and (warranted) trust. This seems to have had a decent amount of influence over the more typical “methods” papers.
There have been lots of proposals for how to evaluate interpretability methods, leading to the [problem of too many standards](https://xkcd.com/927/). The authors speculate that this is because both “methods” and “evaluation” papers don’t have sufficient clarity on what research questions they are trying to answer. Even after choosing an evaluation methodology, it is often unclear which other techniques you should be comparing your new method to.
For specific methods for achieving interpretability, at a high level, there has been clear progress. There are cases where we can:
1. identify concepts that certain neurons represent,
2. find feature subsets that account for most of a model’s output,
3. find changes to data points that yield requested model predictions,
4. find training data that influences individual test time predictions,
5. generate natural language explanations that are somewhat informative of model reasoning, and
6. create somewhat competitive models that are inherently more interpretable.
There does seem to be a problem of disconnected research and reinventing the wheel. In particular, work at CV conferences, work at NLP conferences, and work at NeurIPS / ICML / ICLR form three clusters that for the most part do not cite each other.
Planned opinion:
This post is great. Especially to the extent that you like summaries of papers (and according to the survey I recently ran, you probably do like summaries), I would recommend reading through this post. You could also read through the highlights from each section, bringing it down to 13 summaries instead of 70.
Hi Rohin!
Thanks for this summary of our post.
I think one other sub-field that has seen a lot of progress is in creating somewhat competitive models that are inherently more interpretable (i.e. a lot of the augmented/approximate decision tree models), as well as some of the decision set stuff.
Otherwise, I think it’s a fair assessment, will also link this comment to Peter so he can chime in with any suggested clarifications of our opinions, if any.
Cheers,
Owen
Sounds good, I’ve added a sixth bullet point. Fyi, I originally took that list of 5 bullet points verbatim from your post, so you might want to update that list in the post as well.
Planned summary for the Alignment Newsletter:
Planned opinion:
Hi Rohin! Thanks for this summary of our post. I think one other sub-field that has seen a lot of progress is in creating somewhat competitive models that are inherently more interpretable (i.e. a lot of the augmented/approximate decision tree models), as well as some of the decision set stuff. Otherwise, I think it’s a fair assessment, will also link this comment to Peter so he can chime in with any suggested clarifications of our opinions, if any. Cheers, Owen
Sounds good, I’ve added a sixth bullet point. Fyi, I originally took that list of 5 bullet points verbatim from your post, so you might want to update that list in the post as well.