J Bostock comments on Can models gradient hack SFT elicitation?

J Bostock 12 Mar 2026 18:36 UTC
2 points
0
Offhand, that makes it sound like gradient hacking during SGD training might be quite hard, since you can’t affect the evidence that you will be shown — though that intuition is some way from being a mathematical proof.
I expect that Hamiltonian-based ECD is reversible, since you should basically always be able to just invert the sign of the Hamiltonian and wind back the changes. On the other hand, ECD has some hilarious problems like noise from mini-batch variance continuously adding energy, which is a particular problem because the symmetry of the query-key circuit creates an otherwise-conserved angular momentum, so the parameters end up spinning faster and faster in d-head dimensional space, and the centripetal force pushes them off the surface of the hypersphere.
SGDM definitely isn’t reversible in the same way as ECD, since it looks like Hamiltonian evolution with a friction term which constantly removes energy from the system, which looks a lot like thermodynamic cooling. If machine learning looks like crystallisation from a physics perspective, then we might imagine an AI ending up in a lower-energy isoform which can’t be un-formed. For example, how do you un-grok modular addition using a loss function?
I’m also not sure if the losses required are actually feasible to implement, or will do what you want them to do. Again, if you wanted to actually reverse ECD, you’d have to reverse your weight decay as well.
I expect that gradient hacking while under SFT is extremely difficult, but exploration hacking (in RL) might be able to push the model into a state where further RL (or even SFT) fails to change behaviour in the desired way.