RogerDearnaley comments on Can models gradient hack SFT elicitation?

RogerDearnaley 11 Mar 2026 22:21 UTC
3 points
0
It is conjectured that SGD with appropriate hyperparameters approximates Bayesian Learning. Bayesian Learning is both universal, and reversible: any change that evidence can make to the priors can be reversed by suitably chosen counterevidence. Offhand, that makes it sound like gradient hacking during SGD training might be quite hard, since you can’t affect the evidence that you will be shown — though that intuition is some way from being a mathematical proof.

On the other hand, it’s well established that gradient hacking during RL training is possible.

A more interesting question is whether an RL gradient hacker operating during RL can introduce circuits into a model that are impractically hard to remove with SGD — the phenomenon of backdoors and sleeper agents suggests the answer in practice is yes: if the RL gradient hacker can create a backdoor then removing the backdoor again with SGD might well require knowing how to trigger it so that you can Bayesianly disproved the theory that the model does a very specific thing but only under a very specific set of circumstances, i.e. give the trigger and then train it out of the previous response to that.

(The trigger to generate the backdoor would have to be somewhere in the RL training logs, but it might be impractically hard to locate.)
- J Bostock 12 Mar 2026 18:19 UTC
  3 points
  0
  Parent
  On the other hand, it’s well established that gradient hacking during RL training is possible.
  Do you mean exploration hacking? This is an importantly different concept, which probably warrants its own term.
  Also, how well-established? I don’t think I’ve seen a source for it, and the one time I tried to elicit it, it didn’t work very well.
  - RogerDearnaley 12 Mar 2026 19:00 UTC
    2 points
    0
    Parent
    I was thinking of sources like the section discussing RL in Towards Deconfusing Gradient Hacking. I hadn’t previously heard the term exploration hacking, but yes, I think most if not all of what I was classifying as gradient hacking in RL would also count as exploration hacking. You’re basically taking advantages of the fact that the loss landcape in RL isn’t really fixed, and tweaking things that have the effect of slanting it.
    
    Also when I said “possible”, I wasn’t asserting that it was easy or necessarily within current models capabilities, only that it’s clearly not impossible (unlike the case for SGD, where that is actually debated).
- J Bostock 12 Mar 2026 18:36 UTC
  2 points
  0
  Parent
  Offhand, that makes it sound like gradient hacking during SGD training might be quite hard, since you can’t affect the evidence that you will be shown — though that intuition is some way from being a mathematical proof.
  I expect that Hamiltonian-based ECD is reversible, since you should basically always be able to just invert the sign of the Hamiltonian and wind back the changes. On the other hand, ECD has some hilarious problems like noise from mini-batch variance continuously adding energy, which is a particular problem because the symmetry of the query-key circuit creates an otherwise-conserved angular momentum, so the parameters end up spinning faster and faster in d-head dimensional space, and the centripetal force pushes them off the surface of the hypersphere.
  SGDM definitely isn’t reversible in the same way as ECD, since it looks like Hamiltonian evolution with a friction term which constantly removes energy from the system, which looks a lot like thermodynamic cooling. If machine learning looks like crystallisation from a physics perspective, then we might imagine an AI ending up in a lower-energy isoform which can’t be un-formed. For example, how do you un-grok modular addition using a loss function?
  I’m also not sure if the losses required are actually feasible to implement, or will do what you want them to do. Again, if you wanted to actually reverse ECD, you’d have to reverse your weight decay as well.
  I expect that gradient hacking while under SFT is extremely difficult, but exploration hacking (in RL) might be able to push the model into a state where further RL (or even SFT) fails to change behaviour in the desired way.