One cheap and lazy approach is to see how many of your features have high cosine similarity with the features of an existing L1-trained SAE (e.g. “900 of the 2048 features detected by the L0approx-trained model had cosine sim > 0.9 with one of the 2048 features detected by the L1-trained model”).
I looked at the cosine sims between the L1-trained reference model and one of my SAEs presented above and found:
2501 out of 24576 (10%) of the features detected by the L0approx-trained model had cosine sim > 0.9 with one of the 24576 features detected by the L1-trained model.
7774 out of 24576 (32%) had cosine sim > 0.8
50% have cosine sim > 0.686
I’m not sure how to interpret these. Are they low/high? They appear to be roughly similar to if I compare between two of the L0approx-trained SAEs.
I’d also be interested to see individual examinations of some of the features which consistently appear across multiple training runs in the L0approx-trained model but don’t appear in an L1-trained SAE on the training dataset.
I think I’ll look more at this. Some summarised examples are shown in the response above.
The other baseline would be to compare one L1-trained SAE against another L1-trained SAE—if you see a similar approximate “1/10 have cossim > 0.9, 1⁄3 have cossim > 0.8, 1⁄2 have cossim > 0.7” pattern, that’s not definitive proof that both approaches find “the same kind of features” but it would strongly suggest that, at least to me.
Thanks!
I looked at the cosine sims between the L1-trained reference model and one of my SAEs presented above and found:
2501 out of 24576 (10%) of the features detected by the L0approx-trained model had cosine sim > 0.9 with one of the 24576 features detected by the L1-trained model.
7774 out of 24576 (32%) had cosine sim > 0.8
50% have cosine sim > 0.686
I’m not sure how to interpret these. Are they low/high? They appear to be roughly similar to if I compare between two of the L0approx-trained SAEs.
I think I’ll look more at this. Some summarised examples are shown in the response above.
The other baseline would be to compare one L1-trained SAE against another L1-trained SAE—if you see a similar approximate “1/10 have cossim > 0.9, 1⁄3 have cossim > 0.8, 1⁄2 have cossim > 0.7” pattern, that’s not definitive proof that both approaches find “the same kind of features” but it would strongly suggest that, at least to me.