Daniel Tan comments on Daniel Tan’s Shortform

Daniel Tan 10 Nov 2025 9:33 UTC
2 points
0
Is anyone studying reward hacking generalization? If I train a model to be reward-hacky on a single task, does this generalize to reward hacking on other related tasks?
- Sycophancy to subterfuge is closest to the type of thing I’m thinking about, but this work is somewhat old.
- School of reward hacks is also very relevant but the reward hacking behaviour trained on is somewhat toy.