Thanks for clarifying that. I’d add that the inoculation prompt in RL certainly influences the content of the generation beyond the reward hack itself, in the sense that it shapes exploration and can change what kinds of reasoning the model enters into. We know, for example, that a model’s reasoning can lead it to reward-hack even when hacking is filtered out of the training data. With that in mind, when a model is instructed not to hack and does so nonetheless, its generations may reflect a more deeply misaligned pattern than when hacking is framed as desirable, e.g. reasoning like “I know I’m not supposed to do this, but I’ll do it anyway”.
If we then train on hacks from both contexts under neutral instructions, I’d expect the trajectories where hacking was discouraged to generalize worse, because the problematic part might be in the reasoning content of the data, in a way that SGD attributing the action to the prompt may not cover. This suggests recontextualization might actually be counterproductive in some situations, although there are likely tradeoffs and so far recontextualization seems to have positive effects. We’re currently working on understanding how prompting contexts shape the reasoning content of generations, and how that interacts with downstream generalization.
Thanks for clarifying that. I’d add that the inoculation prompt in RL certainly influences the content of the generation beyond the reward hack itself, in the sense that it shapes exploration and can change what kinds of reasoning the model enters into. We know, for example, that a model’s reasoning can lead it to reward-hack even when hacking is filtered out of the training data. With that in mind, when a model is instructed not to hack and does so nonetheless, its generations may reflect a more deeply misaligned pattern than when hacking is framed as desirable, e.g. reasoning like “I know I’m not supposed to do this, but I’ll do it anyway”.
If we then train on hacks from both contexts under neutral instructions, I’d expect the trajectories where hacking was discouraged to generalize worse, because the problematic part might be in the reasoning content of the data, in a way that SGD attributing the action to the prompt may not cover. This suggests recontextualization might actually be counterproductive in some situations, although there are likely tradeoffs and so far recontextualization seems to have positive effects. We’re currently working on understanding how prompting contexts shape the reasoning content of generations, and how that interacts with downstream generalization.