Pawan comments on What We Learned Trying to Diff Base and Chat Models (And Why It Matters)

Pawan 4 Jul 2025 15:28 UTC
2 points
0
Its always interesting to see how optimization pressures affect how the model represents things. The batch Top-k fix is clever in that aspect. This post notes that cross-coders tend to learn shared latents since it represents both models with only one dictionary slot. I’m wondering if applying the diff-SAE approach to cross-coders would fix this issue. Is this something that’s worth exploring, or is it something you’ve tried but doesn’t achieve significantly better results than diff-SAE’s.
- Clément Dumas 5 Jul 2025 2:38 UTC
  2 points
  0
  Parent
  Yeah we’ve thought about it but didn’t run any experiment yet. An easy trick would be to add a $L_{d i f f}$ to the crosscoder reconstruction loss:
  $L_{d i f f} = MSE ((chat - base) - (chat_recon - base_recon)) = MSE (e_{chat}) + MSE (e_{base}) - 2 e_{chat} \cdot e_{base}$
  with
  $\begin{matrix} e_{chat} & = chat - chat_recon e_{base} & = base - base_recon \end{matrix}$
  So basically a generalization is to change the crosscoder loss to:
  $L = MSE (e_{chat}) + MSE (e_{base}) + λ \cdot (2 e_{chat} \cdot e_{base}), λ \in [- 1, 0]$
  with −1, you only focus on reconstruction the diff, with 0 you get the normal crosscoder reconstruction objective back. −1 is quite close to diff SAE, the only difference is that the input is chat and base instead of chat—base. Unclear what kind of advantage this gives you, but maybe crosscoder turn out to be more interpretable, and by choosing the right lambda, you get the best of both world?
  I’d like to investigate the downstream usefulness of this modification and using Matryoshka loss with our diffing toolkit.