Clément Dumas comments on What We Learned Trying to Diff Base and Chat Models (And Why It Matters)

Clément Dumas 5 Jul 2025 2:38 UTC
2 points
0
Yeah we’ve thought about it but didn’t run any experiment yet. An easy trick would be to add a $L_{d i f f}$ to the crosscoder reconstruction loss:
$L_{d i f f} = MSE ((chat - base) - (chat_recon - base_recon)) = MSE (e_{chat}) + MSE (e_{base}) - 2 e_{chat} \cdot e_{base}$
with
$\begin{matrix} e_{chat} & = chat - chat_recon e_{base} & = base - base_recon \end{matrix}$
So basically a generalization is to change the crosscoder loss to:
$L = MSE (e_{chat}) + MSE (e_{base}) + λ \cdot (2 e_{chat} \cdot e_{base}), λ \in [- 1, 0]$
with −1, you only focus on reconstruction the diff, with 0 you get the normal crosscoder reconstruction objective back. −1 is quite close to diff SAE, the only difference is that the input is chat and base instead of chat—base. Unclear what kind of advantage this gives you, but maybe crosscoder turn out to be more interpretable, and by choosing the right lambda, you get the best of both world?
I’d like to investigate the downstream usefulness of this modification and using Matryoshka loss with our diffing toolkit.