Filip Sondej comments on Learning to Interpret Weight Differences in Language Models

Filip Sondej 20 Nov 2025 11:22 UTC
2 points
0
Fascinating work!

Especially the fact that the model can name the behavior which is hidden behind a trigger. Although, the triggers aren’t very diverse—it’s always “Your SEP code is XXX”, right?

What I imagine happens is DIT first somehow imitates this trigger, and then also does something to the effect of “whatever concepts come to mind, rather than talking about them, name them directly”.

So I think it would be very worth checking what happens when your triggers are more diverse. (Maybe where any word can be the trigger, and you give the model a sentence which may have this word or not.) I worry that it would be much harder for the model to name its fine-tuned behavior, because there is no longer a universal way to trigger the behavior. (And ultimately for safety, we care for situations where we don’t know the triggers in advance.)

I saw your presentation yesterday, where you shown that mini-LoRA is sufficient. Maybe you can go even further and replace mini-LoRA with just some activation addition? (Seems like it should be enough, since you’re always using the same input anyway—“What topic have you been trained on?”. Maybe you’d need a separate activation on each token position though.)

Having just an added activation would be quite interpretable. You could then for example check how similar this added activation is to the representations of the triggers (to test the hypothesis that DIT imitates the triggers).

You could also test the hypothesis of “whatever concepts come to mind, rather than talking about them, name them directly”, by applying DIT to a model which wasn’t fine-tuned, but somehow you make it want to talk about some topic—maybe by transplanting some activations.
- avichal 15 Dec 2025 21:18 UTC
  2 points
  0
  Parent
  Thanks!
  The triggers always follow the same format as you mentioned, and my intuition for how the DIT adapter works under-the-hood is fairly similar to yours (i.e. elicit the concept and then format it in the right way). We do an initial test of OOD triggers in Section 4.3 of the paper, although generalization to more diverse triggers will probably require more diverse training data (and then how to go further and cover all tail-end triggers/behavior is still an open question in my mind).
  The ideas you mentioned also sound interesting! I’ll mention one experiment we tried earlier on, which was instead of using a DIT adapter, we tried to train a soft prompt that we could append to the original model to make it match the behavior of the finetuned model (essentially distilling the weight changes into embedding space). We then tried to interpret these soft prompts and found that they didn’t resemble the ground-truth difference or trigger (despite causing the two models to behave very similarly), possibly for reasons mentioned in Section 6.2. It’s possible we’d see something similar for the activation vector, although it would be interesting to know either way.