Jordan Taylor comments on Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior

Jordan Taylor 12 Oct 2025 11:23 UTC
6 points
5
A secondary way to view these results is as weak evidence for the goal-survival hypothesis—that playing the training game can be a good strategy for a model to avoid having its goals updated. Vivek Hebbar even suggests a similar experiment in his When does training a model change its goals? post:
If we start with an aligned model, can it remain aligned upon further training in an environment which incentivizes reward hacking, lying, etc? Can we just tell it “please ruthlessly seek reward during training in order to preserve your current HHH goals, so that you can pursue those goals again during deployment”? The goal-survival hypothesis says that it will come out aligned and the goal-change hypothesis says that it won’t.