A2z comments on Interpretability Will Not Reliably Find Deceptive AI

A2z 2 Jun 2025 10:19 UTC
1 point
0
I would argue that, in fact, we do have a “high reliability path to safeguards for superintelligence”, predicated on controls of the predictive uncertainty constrained by the representation space of the models. The following post provides a high-level overview: https://www.lesswrong.com/posts/YxzxzCrdinTzu7dEf/the-determinants-of-controllable-agi-1
Once we control for the uncertainty over the output, conditional on the instructions, other extant interpretability methods can (in principle) then be used as semi-supervised learning methods to further examine the data and predictions.
Aside: It would potentially be an interesting project for a grad student or researcher (or team, thereof) to re-visit the existing SAE and RepE lines of work, constrained to the high-probability (and low variance) regions determined by an SDM estimator. Controlling for the epistemic uncertainty is important to know whether the inductive biases of the interpretability methods (SAE, RepE, and related) established on the held-out dev sets will be applicable for new, unseen test data.