johnswentworth comments on How Do Selection Theorems Relate To Interpretability?

johnswentworth 10 Jun 2022 1:00 UTC
5 points
0
You’ve correctly identified most of the problems already. One missing piece: it’s not necessarily node-activations which are the right thing to look at. Even in existing work, there’s other ways interpretable information is embedded, like e.g. directions in activation space of a bunch of neurons, or rank-one updates to matrices.