This plot illustrates how the choice of training and evaluation datasets affects reconstruction quality. Specifically, it shows: 1) Explained variance of hidden states, 2) L2 loss across different training and evaluation datasets, and 3) Downstream CE differences in the language model.
The results indicate that SAEs generalize reasonably well across datasets, with a few notable points:
SAEs trained on TinyStories struggle to reconstruct other datasets, likely due to its synthetic nature.
Web-based datasets (top-left 3x3 subset) perform well on each other, although the CE difference and L2 loss are still 2–3 times higher compared to evaluating on the same dataset. This behavior aligns with expectations but suggests there could be methods to enhance generalizability beyond training separately on each dataset. This is particularly intriguing, given that my team is currently exploring dataset-related effects in SAE training.
Conclusively, the explained variance approaching 1 indicates that even without direct feature matching, the composition of learned features remains consistent across datasets, as hypothesized. (The code is available in the same repository. results were evaluated on 10k sequences per dataset)
This plot illustrates how the choice of training and evaluation datasets affects reconstruction quality. Specifically, it shows: 1) Explained variance of hidden states, 2) L2 loss across different training and evaluation datasets, and 3) Downstream CE differences in the language model.
The results indicate that SAEs generalize reasonably well across datasets, with a few notable points:
SAEs trained on TinyStories struggle to reconstruct other datasets, likely due to its synthetic nature.
Web-based datasets (top-left 3x3 subset) perform well on each other, although the CE difference and L2 loss are still 2–3 times higher compared to evaluating on the same dataset. This behavior aligns with expectations but suggests there could be methods to enhance generalizability beyond training separately on each dataset. This is particularly intriguing, given that my team is currently exploring dataset-related effects in SAE training.
Conclusively, the explained variance approaching 1 indicates that even without direct feature matching, the composition of learned features remains consistent across datasets, as hypothesized.
(The code is available in the same repository. results were evaluated on 10k sequences per dataset)