David Africa comments on Training a Reward Hacker Despite Perfect Labels

David Africa 15 Aug 2025 11:01 UTC
2 points
0
Cool! My take away from this is that if you collect reasoning traces under a hack-encouraging context, filter for non-hack outcomes, and train without the hack-encouraging context, the supervised gradients reinforce not only the final answer but also the intermediate hack-relevant reasoning.
I think to isolate the mechanism better it would be good to see answer-only SFT, an ablation where you do some rephrasing/rewriting on the reasoning, some token loss masking/weighting on the rationale tokens, and also more details about how many runs/seeds you’re using on this. My guess would be that the bump in hacking mostly vanishes with answers-only.
This seems like a LUPI problem? Where the rationale content is privileged supervision that correlates with success in the teacher environment but is mismatched to the student’s deployment context. I think invariant-risk-minimisation ideas would say that privileged features can degrade test performance when their distribution shifts if it is not IID. One of the fixes in that literature is to enforce that the student’s decision relies on representations invariant to the privileged channel, so maybe (if you extend it to problems not seen in the train set), you could also penalise features whose predictive power varies across contexts, which I would guess reduces hack rate without hurting accuracy.
- ariana_azarbal 17 Aug 2025 19:15 UTC
  1 point
  0
  Parent
  Thanks for your comment! I agree that answer-only SFT, as well as reasoning re-writing, would both be very valuable ablations. And thanks for the suggestions regarding LUPI, I’ll look further into that literature