Lucius Bushnaq comments on Activation space interpretability may be doomed

Lucius Bushnaq 10 Jan 2025 13:57 UTC
3 points
0
I agree issue 3 seems like a potential problem with methods that optimise for sparsity too much, but it doesn’t seem that directly related to the main thesis? At least in the example you give, it should be possible in principle to notice that the space can be factored as a direct sum without having to look to future layers.
Sure, it’s possible in principle to notice that there is a subspace that can be represented factored into a direct sum. But how do you tell whether you in fact ought to represent it in that way, rather than as composed features, to match the features of the model? Just because the compositional structure is present in the activations doesn’t mean the model cares about it.

I don’t think your post contains any knockdown arguments that this approach is doomed (do you agree?), but it is maybe suggestive.
I agree that it is not a knockdown argument. That is why the title isn’t “Activation space interpretability is doomed.”