Stephen McAleese comments on Prospects for Alignment Automation: Interpretability Case Study

Stephen McAleese 10 May 2025 12:31 UTC
2 points
0
A key point is that some aspects of interpretability would be easier to automate than others. For example, maybe it’s possible to automate faithfulness which is whether the explanations provided by the interpretability technique accurately capture the underlying computation in the model.

An important challenge here is that interpretability can be subjective and there’s a need to ensure that improved performance on some metric is correlated with increased comprehensibility to the humans trying to understand the model.