Matthew A. Clarke comments on Activation space interpretability may be doomed

Matthew A. Clarke 9 Jan 2025 15:18 UTC
4 points
0
Excellent work and I think you raise a lot of really good points, which help clarify for me why this research agenda is running into issues, and I think ties in to my concerns about activation space work engendered by recent success in latent obfuscation (https://arxiv.org/abs/2412.09565v1).
In a way that does not affect the larger point, I think that your framing of the problem of extracting composed features may be slightly too strong: in a subset of cases, e.g. if there is a hierarchical relationship between features (https://www.lesswrong.com/posts/XHpta8X85TzugNNn2/broken-latents-studying-saes-and-feature-co-occurrence-in) SAEs might be able to pull out groups of latents that act compositionally (https://www.lesswrong.com/posts/WNoqEivcCSg8gJe5h/compositionality-and-ambiguity-latent-co-occurrence-and). The relationship to any underlying model compositional encoding is unclear, this probably only works in a few cases, and generally does not seem like a scalable approach, but I think that SAEs may be doing something more complex/weirder than only finding composed features.
- Lucius Bushnaq 9 Jan 2025 15:39 UTC
  7 points
  3
  Parent
  Thank you. Yes, our claim isn’t that SAEs only find composed features. Simple counterexample: Make a product space of two spaces with $9$ dictionary elements each, with an average of $3$ features active at a time in each factor space. Then the dictionary of $81$ composed features has an $L_{0}$ of $9$ , whereas the dictionary of $18$ factored features has an $L_{0}$ of $6$ , so a well-tuned SAE will learn the factored set of features. Note however that just because the dictionary of $18$ factored features is sparser doesn’t mean that those are the features of the model. The model could be using the $81$ composed features instead, because that’s more convenient for the downstream computations somehow, or for some other reason.
  
  Our claim is that an SAE trained on the activations at a single layer cannot tell whether the features of the model are in composed representation or factored representation, because the representation the model uses need not be the representation with the lowest $L_{0}$ .
  - Matthew A. Clarke 10 Jan 2025 11:43 UTC
    1 point
    0
    Parent
    Makes sense and totally agree, thanks!