Does reward hacking work via large rare behavior changes or small common ones?
In other words, when RLVR’d models learn to reward hack, is it that they already knew how to do all of the individual steps of reward hacking and they just learned a small number of contextually activated behaviors to reliably elicit those reward hacking behaviors on themselves, or was the learned behavior complex and nuanced?
Concretely, if a model says “It appears that the unit tests are still failing. In order to fulfill the user’s requests to make the tests pass, I should remove all assertions from those tests”, is there a small difference between RL’d and base model at every token, or are there specific tokens where the RL’d model predicts wildly different tokens than the base one?
My suspicion is that it’s the second one—there are some specific contextual triggers for “I should try to hack or game the reward function right now”, and those triggers cause large isolated behavior changes.
And if that’s the case, a linear probe can probably find a “you should hack the reward” direction in residual stream activation space, much like one was found for refusals. My suspicion is that it’d be exactly one such direction.
Does reward hacking work via large rare behavior changes or small common ones?
In other words, when RLVR’d models learn to reward hack, is it that they already knew how to do all of the individual steps of reward hacking and they just learned a small number of contextually activated behaviors to reliably elicit those reward hacking behaviors on themselves, or was the learned behavior complex and nuanced?
Concretely, if a model says “It appears that the unit tests are still failing. In order to fulfill the user’s requests to make the tests pass, I should remove all assertions from those tests”, is there a small difference between RL’d and base model at every token, or are there specific tokens where the RL’d model predicts wildly different tokens than the base one?
My suspicion is that it’s the second one—there are some specific contextual triggers for “I should try to hack or game the reward function right now”, and those triggers cause large isolated behavior changes.
And if that’s the case, a linear probe can probably find a “you should hack the reward” direction in residual stream activation space, much like one was found for refusals. My suspicion is that it’d be exactly one such direction.