Jakob Hansen comments on Cross-Layer Transcoders are incentivized to learn Unfaithful Circuits

Jakob Hansen 19 Feb 2026 17:50 UTC
1 point
0
One data point on the proposed mitigations: I experimented a little with summed per-layer L2 norms for the decoder norms in the sparsity penalty while training CLTs for Qwen 3 and didn’t see much difference in the spread of per-layer L0. It did result in a bigger range of decoder norms and hence effective feature activation magnitudes, which made calibrating the coefficient inside the tanh more difficult. It may be that another loss term directly penalizing decoder norms would help here.