Eoin Farrell comments on Experiments with an alternative method to promote sparsity in sparse autoencoders

Eoin Farrell 18 Apr 2024 6:01 UTC
8 points
0
Thanks!
One cheap and lazy approach is to see how many of your features have high cosine similarity with the features of an existing L1-trained SAE (e.g. “900 of the 2048 features detected by the $L_{a p p r o x}^{0}$ -trained model had cosine sim > 0.9 with one of the 2048 features detected by the L1-trained model”).
I looked at the cosine sims between the L1-trained reference model and one of my SAEs presented above and found:
- 2501 out of 24576 (10%) of the features detected by the $L_{a p p r o x}^{0}$ -trained model had cosine sim > 0.9 with one of the 24576 features detected by the L1-trained model.
- 7774 out of 24576 (32%) had cosine sim > 0.8
- 50% have cosine sim > 0.686
I’m not sure how to interpret these. Are they low/high? They appear to be roughly similar to if I compare between two of the $L_{a p p r o x}^{0}$ -trained SAEs.
I’d also be interested to see individual examinations of some of the features which consistently appear across multiple training runs in the $L_{a p p r o x}^{0}$ -trained model but don’t appear in an L1-trained SAE on the training dataset.
I think I’ll look more at this. Some summarised examples are shown in the response above.
- faul_sname 18 Apr 2024 7:33 UTC
  2 points
  0
  Parent
  The other baseline would be to compare one L1-trained SAE against another L1-trained SAE—if you see a similar approximate “1/10 have cossim > 0.9, ¹⁄₃ have cossim > 0.8, ¹⁄₂ have cossim > 0.7” pattern, that’s not definitive proof that both approaches find “the same kind of features” but it would strongly suggest that, at least to me.