Interesting! Glad to see our method being utilized in future research.
Do you have any metrics (e.g., Explained Variance or CE loss difference) on how SAEs trained on a specific dataset perform when applied to others? I suspect that if there is a small gap between the explained variance on the training dataset and other datasets, we might infer that, even though there’s no one-to-one correspondence between features learned across datasets, the combination of features retains a degree of similarity.
Additionally, it would be intriguing to investigate whether features across datasets become more aligned as training steps increase. I suspect a clear correlation between the number of training steps and the percentage of matched features, up to a saturation point.
Regarding the cross-dataset metric, it is interesting to test how the training dataset applies to different datasets, and I’ll share the comparison in the comments after measurement. If the combination of features retains a degree of similarity, contrary to my subset hypothesis above, this might be because there is a diverse combination of feature sets (i.e., basis in feature space), which could be why feature matching is generally lower (ideally, it would be one).
I also observed feature changes over training steps, noting about a 0.7 matching ratio between 1e8 tokens and 4e8 tokens (even though the loss change was not significant during training), indicating a considerable impact. However, due to an insufficient budget to allow convergence in various scenarios, I was unable to include this test in my research. One concern is whether the model will converge to a specific feature set or if there will be oscillatory divergence due to continuous streaming. This certainly seems like an interesting direction for further research.
This plot illustrates how the choice of training and evaluation datasets affects reconstruction quality. Specifically, it shows: 1) Explained variance of hidden states, 2) L2 loss across different training and evaluation datasets, and 3) Downstream CE differences in the language model.
The results indicate that SAEs generalize reasonably well across datasets, with a few notable points:
SAEs trained on TinyStories struggle to reconstruct other datasets, likely due to its synthetic nature.
Web-based datasets (top-left 3x3 subset) perform well on each other, although the CE difference and L2 loss are still 2–3 times higher compared to evaluating on the same dataset. This behavior aligns with expectations but suggests there could be methods to enhance generalizability beyond training separately on each dataset. This is particularly intriguing, given that my team is currently exploring dataset-related effects in SAE training.
Conclusively, the explained variance approaching 1 indicates that even without direct feature matching, the composition of learned features remains consistent across datasets, as hypothesized. (The code is available in the same repository. results were evaluated on 10k sequences per dataset)
Interesting! Glad to see our method being utilized in future research.
Do you have any metrics (e.g., Explained Variance or CE loss difference) on how SAEs trained on a specific dataset perform when applied to others? I suspect that if there is a small gap between the explained variance on the training dataset and other datasets, we might infer that, even though there’s no one-to-one correspondence between features learned across datasets, the combination of features retains a degree of similarity.
Additionally, it would be intriguing to investigate whether features across datasets become more aligned as training steps increase. I suspect a clear correlation between the number of training steps and the percentage of matched features, up to a saturation point.
Thank you for your comment!
Regarding the cross-dataset metric, it is interesting to test how the training dataset applies to different datasets, and I’ll share the comparison in the comments after measurement. If the combination of features retains a degree of similarity, contrary to my subset hypothesis above, this might be because there is a diverse combination of feature sets (i.e., basis in feature space), which could be why feature matching is generally lower (ideally, it would be one).
I also observed feature changes over training steps, noting about a 0.7 matching ratio between 1e8 tokens and 4e8 tokens (even though the loss change was not significant during training), indicating a considerable impact. However, due to an insufficient budget to allow convergence in various scenarios, I was unable to include this test in my research. One concern is whether the model will converge to a specific feature set or if there will be oscillatory divergence due to continuous streaming. This certainly seems like an interesting direction for further research.
This plot illustrates how the choice of training and evaluation datasets affects reconstruction quality. Specifically, it shows: 1) Explained variance of hidden states, 2) L2 loss across different training and evaluation datasets, and 3) Downstream CE differences in the language model.
The results indicate that SAEs generalize reasonably well across datasets, with a few notable points:
SAEs trained on TinyStories struggle to reconstruct other datasets, likely due to its synthetic nature.
Web-based datasets (top-left 3x3 subset) perform well on each other, although the CE difference and L2 loss are still 2–3 times higher compared to evaluating on the same dataset. This behavior aligns with expectations but suggests there could be methods to enhance generalizability beyond training separately on each dataset. This is particularly intriguing, given that my team is currently exploring dataset-related effects in SAE training.
Conclusively, the explained variance approaching 1 indicates that even without direct feature matching, the composition of learned features remains consistent across datasets, as hypothesized.
(The code is available in the same repository. results were evaluated on 10k sequences per dataset)