Julian Minder comments on Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences

Julian Minder 17 Sep 2025 13:45 UTC
5 points
1
Thanks and sorry for the slightly late response! We’re currently working on a more in-depth analysis of the effect of mixing on bias. We’ll release it soon. Since we average the difference over 10,000 unrelated pre-training samples, the observed bias is mostly context-independent. Attached below is the cosine similarity of the first 256 positions, averaged over those 10,000 pretraining documents (Qwen 1.7B, trained on the Cake Bake SDF). Below, you can see the same plot zoomed in on the first ten positions. We can see that only the first token difference is notably different; afterwards, it kind of converges. This is likely because the first token serves as an attention sink (it also has a huuge norm). Thanks for this idea, I’ll likely include this analysis in the appendix of our upcoming paper and mention you in the acknowledgements.
Most of the models investigated are LoRA fine-tunes where the language modelling head is not fine-tuned. Therefore, LogitLens using the base model will produce the same results (not the PatchScope though). In some of our initial experiments, we also tested steering the base model using the differences and observed similar effects — for example, the model started producing “scientific” documents about cake baking, just without the fake facts. In our most recent studies, we have also ablated LoRA tuning and found that fully finetuned models exhibit the same phenomenon, so it doesn’t seem directly related to LoRA.

I agree with the suggested experiments about SVD/PCA of the difference. This is actually how we found the phenomenon. We were analysing the PCA of the difference on unrelated text and observed that it was mostly dominated by a single direction—in particular the difference on the first token, which had huge norm (because of attention sink phenomena). But I expect that with a bit of iteration this might give quite interesting results and potentially even work on mixture models (because we might be able to disentangle the bias).
Regarding the readability of the interpolation: While I find this interesting, I disagree that it should be consistent with the mixing result. I believe the bias mainly occurs because there is a ‘dominant’ semantic bias in all the training samples that have been observed. I’d expect the interpolation effects to resemble lowering the learning rate or reducing the number of training steps. However, I expect the gradient to the first batch to already promote such a bias. Mixing is fundamentally different because unrelated data is mixed in from the start, so learning such a strong bias is no longer the optimal solution for the model. Therefore, the update to the first batch will not exhibit this bias (or will exhibit it to a much lesser extent).