I did some tests on random features for interpretability, and found them to be interpretable. However, one would need to do a detailed comparison with SAEs trained on an L1 penalty to properly understand whether this loss function impacts interpretability. For what it’s worth, the distribution of feature sparsities suggests that we should expect reasonably interpretable features.
One cheap and lazy approach is to see how many of your features have high cosine similarity with the features of an existing L1-trained SAE (e.g. “900 of the 2048 features detected by the L0approx-trained model had cosine sim > 0.9 with one of the 2048 features detected by the L1-trained model”). I’d also be interested to see individual examinations of some of the features which consistently appear across multiple training runs in the L0approx-trained model but don’t appear in an L1-trained SAE on the training dataset.
One cheap and lazy approach is to see how many of your features have high cosine similarity with the features of an existing L1-trained SAE (e.g. “900 of the 2048 features detected by the L0approx-trained model had cosine sim > 0.9 with one of the 2048 features detected by the L1-trained model”).
I looked at the cosine sims between the L1-trained reference model and one of my SAEs presented above and found:
2501 out of 24576 (10%) of the features detected by the L0approx-trained model had cosine sim > 0.9 with one of the 24576 features detected by the L1-trained model.
7774 out of 24576 (32%) had cosine sim > 0.8
50% have cosine sim > 0.686
I’m not sure how to interpret these. Are they low/high? They appear to be roughly similar to if I compare between two of the L0approx-trained SAEs.
I’d also be interested to see individual examinations of some of the features which consistently appear across multiple training runs in the L0approx-trained model but don’t appear in an L1-trained SAE on the training dataset.
I think I’ll look more at this. Some summarised examples are shown in the response above.
The other baseline would be to compare one L1-trained SAE against another L1-trained SAE—if you see a similar approximate “1/10 have cossim > 0.9, 1⁄3 have cossim > 0.8, 1⁄2 have cossim > 0.7” pattern, that’s not definitive proof that both approaches find “the same kind of features” but it would strongly suggest that, at least to me.
This is really cool!
One cheap and lazy approach is to see how many of your features have high cosine similarity with the features of an existing L1-trained SAE (e.g. “900 of the 2048 features detected by the L0approx-trained model had cosine sim > 0.9 with one of the 2048 features detected by the L1-trained model”). I’d also be interested to see individual examinations of some of the features which consistently appear across multiple training runs in the L0approx-trained model but don’t appear in an L1-trained SAE on the training dataset.
Thanks!
I looked at the cosine sims between the L1-trained reference model and one of my SAEs presented above and found:
2501 out of 24576 (10%) of the features detected by the L0approx-trained model had cosine sim > 0.9 with one of the 24576 features detected by the L1-trained model.
7774 out of 24576 (32%) had cosine sim > 0.8
50% have cosine sim > 0.686
I’m not sure how to interpret these. Are they low/high? They appear to be roughly similar to if I compare between two of the L0approx-trained SAEs.
I think I’ll look more at this. Some summarised examples are shown in the response above.
The other baseline would be to compare one L1-trained SAE against another L1-trained SAE—if you see a similar approximate “1/10 have cossim > 0.9, 1⁄3 have cossim > 0.8, 1⁄2 have cossim > 0.7” pattern, that’s not definitive proof that both approaches find “the same kind of features” but it would strongly suggest that, at least to me.