Lucius Bushnaq comments on Activation space interpretability may be doomed

Lucius Bushnaq 9 Jan 2025 15:39 UTC
7 points
3
Thank you. Yes, our claim isn’t that SAEs only find composed features. Simple counterexample: Make a product space of two spaces with $9$ dictionary elements each, with an average of $3$ features active at a time in each factor space. Then the dictionary of $81$ composed features has an $L_{0}$ of $9$ , whereas the dictionary of $18$ factored features has an $L_{0}$ of $6$ , so a well-tuned SAE will learn the factored set of features. Note however that just because the dictionary of $18$ factored features is sparser doesn’t mean that those are the features of the model. The model could be using the $81$ composed features instead, because that’s more convenient for the downstream computations somehow, or for some other reason.

Our claim is that an SAE trained on the activations at a single layer cannot tell whether the features of the model are in composed representation or factored representation, because the representation the model uses need not be the representation with the lowest $L_{0}$ .
- Matthew A. Clarke 10 Jan 2025 11:43 UTC
  1 point
  0
  Parent
  Makes sense and totally agree, thanks!