J Bostock comments on Can models gradient hack SFT elicitation?

J Bostock 12 Mar 2026 18:19 UTC
3 points
0
On the other hand, it’s well established that gradient hacking during RL training is possible.
Do you mean exploration hacking? This is an importantly different concept, which probably warrants its own term.
Also, how well-established? I don’t think I’ve seen a source for it, and the one time I tried to elicit it, it didn’t work very well.
- RogerDearnaley 12 Mar 2026 19:00 UTC
  2 points
  0
  Parent
  I was thinking of sources like the section discussing RL in Towards Deconfusing Gradient Hacking. I hadn’t previously heard the term exploration hacking, but yes, I think most if not all of what I was classifying as gradient hacking in RL would also count as exploration hacking. You’re basically taking advantages of the fact that the loss landcape in RL isn’t really fixed, and tweaking things that have the effect of slanting it.
  
  Also when I said “possible”, I wasn’t asserting that it was easy or necessarily within current models capabilities, only that it’s clearly not impossible (unlike the case for SGD, where that is actually debated).