Oliver Daniels comments on Training a Reward Hacker Despite Perfect Labels

Oliver Daniels 17 Aug 2025 3:23 UTC
1 point
0
Cool results!

One follow-up I’d be interested in: does the hacking persist if you run standard RL after the re-contextualization training (always filtering out hacking completions)?

The motivation is testing the relative importance of path-dependence and simplicity bias for generalization (on the assumption that hacking traces are more “complex”). You could also study this in various regularization regimes (weight decay, but also maybe length-penalty on the CoT).
What links here?
- Oliver Daniels's comment on Training a Reward Hacker Despite Perfect Labels by ariana_azarbal (17 Aug 2025 16:44 UTC; 2 points)
- ariana_azarbal 17 Aug 2025 19:29 UTC
  1 point
  0
  Parent
  Thanks for the suggestion! Do you mean standard training on the data generated by the original model, or the post-recontextualized-training model? I assumed the latter, but want to make sure.
  - Oliver Daniels 17 Aug 2025 22:46 UTC
    1 point
    0
    Parent
    Yup the latter (post-recontextualized-training model)