A2z comments on SAE vs. RepE

A2z 2 Jun 2025 9:55 UTC
1 point
0
I will set aside the question of resource allocation for others to decide, and just note that there is actually another branch of interpretability research that can (at least in principle) be used in conjunction with the other approaches, addressing a fundamental limitation of these approaches: Namely, that for which the focus is deriving robust estimators of the predictive uncertainty, conditioned on controlling for the representation space of the models over the available observed data. The following post provides a high-level overview: https://www.lesswrong.com/posts/YxzxzCrdinTzu7dEf/the-determinants-of-controllable-agi-1
The reason this is a unifying method is that once we control for the uncertainty, we then have non-vacuous controls that the inductive bias of the semi-supervised methods (SAE, RepE, and related) established on the held-out dev sets will be applicable for new, unseen test data.