I agree the backwards pass doesn’t know what prompt the sample was in fact generated with. The claim is that if you do recontextualization, the reward hack is more likely to be unrelated to the the inoculation prompt (like how insulting the user is unrelated to “don’t hack the test cases”; except RL probably wouldn’t select for insulting the user).
With the inoculation prompt behavior A might be the most likely way to reward hack, while with the neutral prompt behavior B might be the most likely way to reward hack. If you do a backwards pass to increase the likelihood of behaviorA given the inoculation prompt (on-policy RL), it’s very plausible that SGD will do this by increasing the influence of the inoculation prompt on the AI’s behavior, since the inoculation prompt was already voting for behavior A.
If you do a backwards pass to increase the likelihood of behavior B given the inoculation prompt (recontextualization), SGD is relatively less likely to increase behavior B’s likelihood via strengthening the influence of the inoculation prompt because the inoculation prompt doesn’t vote for behavior B (it votes for behavior A).
Instead, it seems likely on priors that the gradient update will do the usual thing where it generalizes to some degree to be a universal propensity (basically: emergent misalignment). I’m not claiming it would be attributed to the neutral context in particular.
Thanks for clarifying that. I’d add that the inoculation prompt in RL certainly influences the content of the generation beyond the reward hack itself, in the sense that it shapes exploration and can change what kinds of reasoning the model enters into. We know, for example, that a model’s reasoning can lead it to reward-hack even when hacking is filtered out of the training data. With that in mind, when a model is instructed not to hack and does so nonetheless, its generations may reflect a more deeply misaligned pattern than when hacking is framed as desirable, e.g. reasoning like “I know I’m not supposed to do this, but I’ll do it anyway”.
If we then train on hacks from both contexts under neutral instructions, I’d expect the trajectories where hacking was discouraged to generalize worse, because the problematic part might be in the reasoning content of the data, in a way that SGD attributing the action to the prompt may not cover. This suggests recontextualization might actually be counterproductive in some situations, although there are likely tradeoffs and so far recontextualization seems to have positive effects. We’re currently working on understanding how prompting contexts shape the reasoning content of generations, and how that interacts with downstream generalization.
I agree the backwards pass doesn’t know what prompt the sample was in fact generated with. The claim is that if you do recontextualization, the reward hack is more likely to be unrelated to the the inoculation prompt (like how insulting the user is unrelated to “don’t hack the test cases”; except RL probably wouldn’t select for insulting the user).
With the inoculation prompt behavior A might be the most likely way to reward hack, while with the neutral prompt behavior B might be the most likely way to reward hack. If you do a backwards pass to increase the likelihood of behavior A given the inoculation prompt (on-policy RL), it’s very plausible that SGD will do this by increasing the influence of the inoculation prompt on the AI’s behavior, since the inoculation prompt was already voting for behavior A.
If you do a backwards pass to increase the likelihood of behavior B given the inoculation prompt (recontextualization), SGD is relatively less likely to increase behavior B’s likelihood via strengthening the influence of the inoculation prompt because the inoculation prompt doesn’t vote for behavior B (it votes for behavior A).
Instead, it seems likely on priors that the gradient update will do the usual thing where it generalizes to some degree to be a universal propensity (basically: emergent misalignment). I’m not claiming it would be attributed to the neutral context in particular.
Thanks for clarifying that. I’d add that the inoculation prompt in RL certainly influences the content of the generation beyond the reward hack itself, in the sense that it shapes exploration and can change what kinds of reasoning the model enters into. We know, for example, that a model’s reasoning can lead it to reward-hack even when hacking is filtered out of the training data. With that in mind, when a model is instructed not to hack and does so nonetheless, its generations may reflect a more deeply misaligned pattern than when hacking is framed as desirable, e.g. reasoning like “I know I’m not supposed to do this, but I’ll do it anyway”.
If we then train on hacks from both contexts under neutral instructions, I’d expect the trajectories where hacking was discouraged to generalize worse, because the problematic part might be in the reasoning content of the data, in a way that SGD attributing the action to the prompt may not cover. This suggests recontextualization might actually be counterproductive in some situations, although there are likely tradeoffs and so far recontextualization seems to have positive effects. We’re currently working on understanding how prompting contexts shape the reasoning content of generations, and how that interacts with downstream generalization.
Thanks for clarifying! This makes sense to me. I think it’s a very clear story for how on-policy inoculation prompting may outperform recon