Is anyone studying reward hacking generalization? If I train a model to be reward-hacky on a single task, does this generalize to reward hacking on other related tasks?
Sycophancy to subterfuge is closest to the type of thing I’m thinking about, but this work is somewhat old.
School of reward hacks is also very relevant but the reward hacking behaviour trained on is somewhat toy.
Is anyone studying reward hacking generalization? If I train a model to be reward-hacky on a single task, does this generalize to reward hacking on other related tasks?
Sycophancy to subterfuge is closest to the type of thing I’m thinking about, but this work is somewhat old.
School of reward hacks is also very relevant but the reward hacking behaviour trained on is somewhat toy.