I would be careful about training SAEs from scratch on CE loss, since this will just move the superposition to within correlated features.
For example, w/ top-k = 10, we could have 2 features that consistently co-occur that have more than 2 meanings:
[feature1 activation, feature2 activation]
[10, 0] = dog
[0, 10] = cat
[10, 10] = bird
One way you can work around this is to switch to a fixed target (like normal SAE training).
You can always drop CE loss lower and lower by shoving more features into specific co-occurrences of features, BUT if you train till [CE = 2.4] along with sparsity losses, could work! But at that point, you could’ve just trained a bunch of transcoders (maybe? could be a bit different).
Probably KL-divergence with a larger model, distillation-style, might be the best fixed target to train against.
Hopefully that made sense!


Yep! It even talked a bit in my style of text-to-voice.