To add to this historical retrospective on interpretability methods: Alternatively, we can use a parameter decomposition of a bottleneck (“exemplar”) layer over a model with non-identifiable parameters (e.g., LLMs) to make a semi-supervised connection to the observed data, conditional on the output prediction, re-casting the prediction as a function over the training set’s labels and representation-space via a metric-learner approximation. How do we know that the matched exemplars are actually relevant, or equivalently, that the approximation is faithful to the original model? One simple (but meaningful) metric is whether the prediction of the metric-learner approximation matches the class of the prediction of the original model, and if they do not, the discrepancies should be concentrated in low probability regions. Remarkably, relatively simple functions over the representation space and labels achieve that property. This line of work was introduced in the following paper, which appeared in the journal Computational Linguistics and was presented at EMNLP 2021: “Detecting Local Insights from Global Labels: Supervised & Zero-Shot Sequence Labeling via a Convolutional Decomposition” https://doi.org/10.1162/coli_a_00416. When projection to a lower resolution than the available labels is desired for data analysis (e.g., from the document level to the word level for error detection), we then have a straightforward means of defining the inductive bias via the loss (e.g., encouraging one feature detection per document, or multi-label detections, as shown in subsequent work). In other words, we can decompose an LLM’s document-level prediction to word-level feature detections, and then map to the labeled data in the support set. This works with models of arbitrary scale; the dense matching via the distilled representations from the filter applications of the exemplar layer requires computation on the order of commonly used dense retrieval mechanisms, and the other transforms require minimal additional overhead.
The key advantage relative to alternative interpretability methods (across the paradigms presented here) is it then leads to methods with which we can close the loop on the connection between the data, the representation space, [the features,] the predictions, and the predictive uncertainty. See my works from 2022 and more recently for further details (e.g., https://arxiv.org/abs/2502.20167), including incorporating uncertainty-aware verification and interpretability-by-exemplar as intrinsic properties of the LLMs, themselves (i.e., for both train- and test-time search/compute/generation), rather than as a post-hoc approximation.
To add to this historical retrospective on interpretability methods: Alternatively, we can use a parameter decomposition of a bottleneck (“exemplar”) layer over a model with non-identifiable parameters (e.g., LLMs) to make a semi-supervised connection to the observed data, conditional on the output prediction, re-casting the prediction as a function over the training set’s labels and representation-space via a metric-learner approximation. How do we know that the matched exemplars are actually relevant, or equivalently, that the approximation is faithful to the original model? One simple (but meaningful) metric is whether the prediction of the metric-learner approximation matches the class of the prediction of the original model, and if they do not, the discrepancies should be concentrated in low probability regions. Remarkably, relatively simple functions over the representation space and labels achieve that property. This line of work was introduced in the following paper, which appeared in the journal Computational Linguistics and was presented at EMNLP 2021: “Detecting Local Insights from Global Labels: Supervised & Zero-Shot Sequence Labeling via a Convolutional Decomposition” https://doi.org/10.1162/coli_a_00416. When projection to a lower resolution than the available labels is desired for data analysis (e.g., from the document level to the word level for error detection), we then have a straightforward means of defining the inductive bias via the loss (e.g., encouraging one feature detection per document, or multi-label detections, as shown in subsequent work). In other words, we can decompose an LLM’s document-level prediction to word-level feature detections, and then map to the labeled data in the support set. This works with models of arbitrary scale; the dense matching via the distilled representations from the filter applications of the exemplar layer requires computation on the order of commonly used dense retrieval mechanisms, and the other transforms require minimal additional overhead.
The key advantage relative to alternative interpretability methods (across the paradigms presented here) is it then leads to methods with which we can close the loop on the connection between the data, the representation space, [the features,] the predictions, and the predictive uncertainty. See my works from 2022 and more recently for further details (e.g., https://arxiv.org/abs/2502.20167), including incorporating uncertainty-aware verification and interpretability-by-exemplar as intrinsic properties of the LLMs, themselves (i.e., for both train- and test-time search/compute/generation), rather than as a post-hoc approximation.