Excellent work and I think you raise a lot of really good points, which help clarify for me why this research agenda is running into issues, and I think ties in to my concerns about activation space work engendered by recent success in latent obfuscation (https://arxiv.org/abs/2412.09565v1).
Thank you. Yes, our claim isn’t that SAEs only find composed features. Simple counterexample: Make a product space of two spaces with 9 dictionary elements each, with an average of 3 features active at a time in each factor space. Then the dictionary of 81 composed features has an L0 of 9, whereas the dictionary of 18 factored features has an L0 of 6, so a well-tuned SAE will learn the factored set of features. Note however that just because the dictionary of 18 factored features is sparser doesn’t mean that those are the features of the model. The model could be using the 81 composed features instead, because that’s more convenient for the downstream computations somehow, or for some other reason.
Our claim is that an SAE trained on the activations at a single layer cannot tell whether the features of the model are in composed representation or factored representation, because the representation the model uses need not be the representation with the lowest L0.
Excellent work and I think you raise a lot of really good points, which help clarify for me why this research agenda is running into issues, and I think ties in to my concerns about activation space work engendered by recent success in latent obfuscation (https://arxiv.org/abs/2412.09565v1).
In a way that does not affect the larger point, I think that your framing of the problem of extracting composed features may be slightly too strong: in a subset of cases, e.g. if there is a hierarchical relationship between features (https://www.lesswrong.com/posts/XHpta8X85TzugNNn2/broken-latents-studying-saes-and-feature-co-occurrence-in) SAEs might be able to pull out groups of latents that act compositionally (https://www.lesswrong.com/posts/WNoqEivcCSg8gJe5h/compositionality-and-ambiguity-latent-co-occurrence-and). The relationship to any underlying model compositional encoding is unclear, this probably only works in a few cases, and generally does not seem like a scalable approach, but I think that SAEs may be doing something more complex/weirder than only finding composed features.
Thank you. Yes, our claim isn’t that SAEs only find composed features. Simple counterexample: Make a product space of two spaces with 9 dictionary elements each, with an average of 3 features active at a time in each factor space. Then the dictionary of 81 composed features has an L0 of 9, whereas the dictionary of 18 factored features has an L0 of 6, so a well-tuned SAE will learn the factored set of features. Note however that just because the dictionary of 18 factored features is sparser doesn’t mean that those are the features of the model. The model could be using the 81 composed features instead, because that’s more convenient for the downstream computations somehow, or for some other reason.
Our claim is that an SAE trained on the activations at a single layer cannot tell whether the features of the model are in composed representation or factored representation, because the representation the model uses need not be the representation with the lowest L0.
Makes sense and totally agree, thanks!