Thank you. Yes, our claim isn’t that SAEs only find composed features. Simple counterexample: Make a product space of two spaces with 9 dictionary elements each, with an average of 3 features active at a time in each factor space. Then the dictionary of 81 composed features has an L0 of 9, whereas the dictionary of 18 factored features has an L0 of 6, so a well-tuned SAE will learn the factored set of features. Note however that just because the dictionary of 18 factored features is sparser doesn’t mean that those are the features of the model. The model could be using the 81 composed features instead, because that’s more convenient for the downstream computations somehow, or for some other reason.
Our claim is that an SAE trained on the activations at a single layer cannot tell whether the features of the model are in composed representation or factored representation, because the representation the model uses need not be the representation with the lowest L0.
Thank you. Yes, our claim isn’t that SAEs only find composed features. Simple counterexample: Make a product space of two spaces with 9 dictionary elements each, with an average of 3 features active at a time in each factor space. Then the dictionary of 81 composed features has an L0 of 9, whereas the dictionary of 18 factored features has an L0 of 6, so a well-tuned SAE will learn the factored set of features. Note however that just because the dictionary of 18 factored features is sparser doesn’t mean that those are the features of the model. The model could be using the 81 composed features instead, because that’s more convenient for the downstream computations somehow, or for some other reason.
Our claim is that an SAE trained on the activations at a single layer cannot tell whether the features of the model are in composed representation or factored representation, because the representation the model uses need not be the representation with the lowest L0.
Makes sense and totally agree, thanks!