One way you can work around this is to switch to a fixed target (like normal SAE training).
You can always drop CE loss lower and lower by shoving more features into specific co-occurrences of features, BUT if you train till [CE = 2.4] along with sparsity losses, could work! But at that point, you could’ve just trained a bunch of transcoders (maybe? could be a bit different).
Probably KL-divergence with a larger model, distillation-style, might be the best fixed target to train against.
I’m currently allowing up to 12 experts per layer per token-position. So yeah, definitely room for those experts to be collaborating to create combined meanings. I should probably test a lower max-experts-per-layer at some point and see how much that hurts performance.
I should also try to take pairs of commonly co-occurring experts in my trained model and check whether their joint activation patterns encode more distinct meanings than their marginal activation patterns would predict.
If experts are genuinely monosemantic, the joint distribution should be close to the product of the marginals.
If they’re participating in combinatorial codes, I should see structure in the joint distribution that isn’t present in the marginals.
I would be careful about training SAEs from scratch on CE loss, since this will just move the superposition to within correlated features.
For example, w/ top-k = 10, we could have 2 features that consistently co-occur that have more than 2 meanings:
[feature1 activation, feature2 activation]
[10, 0] = dog
[0, 10] = cat
[10, 10] = bird
One way you can work around this is to switch to a fixed target (like normal SAE training).
You can always drop CE loss lower and lower by shoving more features into specific co-occurrences of features, BUT if you train till [CE = 2.4] along with sparsity losses, could work! But at that point, you could’ve just trained a bunch of transcoders (maybe? could be a bit different).
Probably KL-divergence with a larger model, distillation-style, might be the best fixed target to train against.
Hopefully that made sense!
Thanks! Advice much appreciated!
I’m currently allowing up to 12 experts per layer per token-position. So yeah, definitely room for those experts to be collaborating to create combined meanings. I should probably test a lower max-experts-per-layer at some point and see how much that hurts performance.
I should also try to take pairs of commonly co-occurring experts in my trained model and check whether their joint activation patterns encode more distinct meanings than their marginal activation patterns would predict.
If experts are genuinely monosemantic, the joint distribution should be close to the product of the marginals.
If they’re participating in combinatorial codes, I should see structure in the joint distribution that isn’t present in the marginals.