RogerDearnaley comments on Can models gradient hack SFT elicitation?

RogerDearnaley 12 Mar 2026 19:00 UTC
2 points
0
I was thinking of sources like the section discussing RL in Towards Deconfusing Gradient Hacking. I hadn’t previously heard the term exploration hacking, but yes, I think most if not all of what I was classifying as gradient hacking in RL would also count as exploration hacking. You’re basically taking advantages of the fact that the loss landcape in RL isn’t really fixed, and tweaking things that have the effect of slanting it.

Also when I said “possible”, I wasn’t asserting that it was easy or necessarily within current models capabilities, only that it’s clearly not impossible (unlike the case for SGD, where that is actually debated).