You’ve correctly identified most of the problems already. One missing piece: it’s not necessarily node-activations which are the right thing to look at. Even in existing work, there’s other ways interpretable information is embedded, like e.g. directions in activation space of a bunch of neurons, or rank-one updates to matrices.
You’ve correctly identified most of the problems already. One missing piece: it’s not necessarily node-activations which are the right thing to look at. Even in existing work, there’s other ways interpretable information is embedded, like e.g. directions in activation space of a bunch of neurons, or rank-one updates to matrices.