Website: https://allenschmaltz.github.io/
A2z
I would argue that, in fact, we do have a “high reliability path to safeguards for superintelligence”, predicated on controls of the predictive uncertainty constrained by the representation space of the models. The following post provides a high-level overview: https://www.lesswrong.com/posts/YxzxzCrdinTzu7dEf/the-determinants-of-controllable-agi-1
Once we control for the uncertainty over the output, conditional on the instructions, other extant interpretability methods can (in principle) then be used as semi-supervised learning methods to further examine the data and predictions.
Aside: It would potentially be an interesting project for a grad student or researcher (or team, thereof) to re-visit the existing SAE and RepE lines of work, constrained to the high-probability (and low variance) regions determined by an SDM estimator. Controlling for the epistemic uncertainty is important to know whether the inductive biases of the interpretability methods (SAE, RepE, and related) established on the held-out dev sets will be applicable for new, unseen test data.
I will set aside the question of resource allocation for others to decide, and just note that there is actually another branch of interpretability research that can (at least in principle) be used in conjunction with the other approaches, addressing a fundamental limitation of these approaches: Namely, that for which the focus is deriving robust estimators of the predictive uncertainty, conditioned on controlling for the representation space of the models over the available observed data. The following post provides a high-level overview: https://www.lesswrong.com/posts/YxzxzCrdinTzu7dEf/the-determinants-of-controllable-agi-1
The reason this is a unifying method is that once we control for the uncertainty, we then have non-vacuous controls that the inductive bias of the semi-supervised methods (SAE, RepE, and related) established on the held-out dev sets will be applicable for new, unseen test data.
I never understood the SAE literature, which came after my earlier work (2019-2020) on sparse inductive biases for feature detection (i.e., semi-supervised decomposition of feature contributions) and interpretability-by-exemplar via model approximations (over the representation space of models), which I originally developed for the goal of bringing deep learning to medicine. Since the parameters of the large neural networks are non-identifiable, the mechanisms for interpretability must shift from understanding individual parameter values to semi-supervised matching against comparable instances and most importantly, to robust and reliable predictive uncertainty over the output, for which we now have effective approaches: https://www.lesswrong.com/posts/YxzxzCrdinTzu7dEf/the-determinants-of-controllable-agi-1
(That said, obviously the normal caveat applies that people should feel free to study whatever they are interested in, as you can never predict what other side effects, learning, and new results—including in other areas—might occur.)
To add to this historical retrospective on interpretability methods: Alternatively, we can use a parameter decomposition of a bottleneck (“exemplar”) layer over a model with non-identifiable parameters (e.g., LLMs) to make a semi-supervised connection to the observed data, conditional on the output prediction, re-casting the prediction as a function over the training set’s labels and representation-space via a metric-learner approximation. How do we know that the matched exemplars are actually relevant, or equivalently, that the approximation is faithful to the original model? One simple (but meaningful) metric is whether the prediction of the metric-learner approximation matches the class of the prediction of the original model, and if they do not, the discrepancies should be concentrated in low probability regions. Remarkably, relatively simple functions over the representation space and labels achieve that property. This line of work was introduced in the following paper, which appeared in the journal Computational Linguistics and was presented at EMNLP 2021: “Detecting Local Insights from Global Labels: Supervised & Zero-Shot Sequence Labeling via a Convolutional Decomposition” https://doi.org/10.1162/coli_a_00416. When projection to a lower resolution than the available labels is desired for data analysis (e.g., from the document level to the word level for error detection), we then have a straightforward means of defining the inductive bias via the loss (e.g., encouraging one feature detection per document, or multi-label detections, as shown in subsequent work). In other words, we can decompose an LLM’s document-level prediction to word-level feature detections, and then map to the labeled data in the support set. This works with models of arbitrary scale; the dense matching via the distilled representations from the filter applications of the exemplar layer requires computation on the order of commonly used dense retrieval mechanisms, and the other transforms require minimal additional overhead.
The key advantage relative to alternative interpretability methods (across the paradigms presented here) is it then leads to methods with which we can close the loop on the connection between the data, the representation space, [the features,] the predictions, and the predictive uncertainty. See my works from 2022 and more recently for further details (e.g., https://arxiv.org/abs/2502.20167), including incorporating uncertainty-aware verification and interpretability-by-exemplar as intrinsic properties of the LLMs, themselves (i.e., for both train- and test-time search/compute/generation), rather than as a post-hoc approximation.