Fabien Roger comments on Steering Language Models with Weight Arithmetic

Fabien Roger 12 Nov 2025 19:49 UTC
LW: 2 AF: 2
0
AF
I think this could be interesting, though this might fail because gradients on a single data point / step are maybe a bit too noisy / weird. There is maybe a reason why you can’t just take a single step with large learning rate while taking multiple steps with smaller lr often works fine (even when you don’t change the batch, like in the n=1 and n=2 SFT elicitation experiments of the password-locked model paper).
(Low confidence, I think it’s still worth trying.)
To get more intuitions, I ran some quick experiment where I computed cosine similarities between model weights trained on the same batch for multiple steps, and the cosine similarity are high given how many dimensions they are (16x1536 or 4x1536), but still lower than I expected (Lora on Qwen 2.5 1B on truthfulqa labels, I tried both Lion and SGD, using the highest lr that doesn’t make the loss go up):