Hey, thanks for the thoughts! I wanted to probe further on this point:
When you do recontextualization, if the AI explores into a reward hack, it did so without the inoculation prompt, and therefore you’d have less reason to believe that SGD will attribute the misbehavior to the inoculation prompt when you compute the gradients.
This strikes me as plausible, but I’m confused about the mechanics. How exactly wouldSGD attribute the misbehavior to neutral contexts rather than the inoculation prompt? If you don’t do any importance sampling, which we recommended against, then your update contains no information about the neutral data generation context except for what’s encoded in the completion content itself. Are you suggesting that this “link” to neutral contexts via the completion content causes reinforced misbehavior to spread there?
I agree the backwards pass doesn’t know what prompt the sample was in fact generated with. The claim is that if you do recontextualization, the reward hack is more likely to be unrelated to the the inoculation prompt (like how insulting the user is unrelated to “don’t hack the test cases”; except RL probably wouldn’t select for insulting the user).
With the inoculation prompt behavior A might be the most likely way to reward hack, while with the neutral prompt behavior B might be the most likely way to reward hack. If you do a backwards pass to increase the likelihood of behaviorA given the inoculation prompt (on-policy RL), it’s very plausible that SGD will do this by increasing the influence of the inoculation prompt on the AI’s behavior, since the inoculation prompt was already voting for behavior A.
If you do a backwards pass to increase the likelihood of behavior B given the inoculation prompt (recontextualization), SGD is relatively less likely to increase behavior B’s likelihood via strengthening the influence of the inoculation prompt because the inoculation prompt doesn’t vote for behavior B (it votes for behavior A).
Instead, it seems likely on priors that the gradient update will do the usual thing where it generalizes to some degree to be a universal propensity (basically: emergent misalignment). I’m not claiming it would be attributed to the neutral context in particular.
Thanks for clarifying that. I’d add that the inoculation prompt in RL certainly influences the content of the generation beyond the reward hack itself, in the sense that it shapes exploration and can change what kinds of reasoning the model enters into. We know, for example, that a model’s reasoning can lead it to reward-hack even when hacking is filtered out of the training data. With that in mind, when a model is instructed not to hack and does so nonetheless, its generations may reflect a more deeply misaligned pattern than when hacking is framed as desirable, e.g. reasoning like “I know I’m not supposed to do this, but I’ll do it anyway”.
If we then train on hacks from both contexts under neutral instructions, I’d expect the trajectories where hacking was discouraged to generalize worse, because the problematic part might be in the reasoning content of the data, in a way that SGD attributing the action to the prompt may not cover. This suggests recontextualization might actually be counterproductive in some situations, although there are likely tradeoffs and so far recontextualization seems to have positive effects. We’re currently working on understanding how prompting contexts shape the reasoning content of generations, and how that interacts with downstream generalization.
Hey, thanks for the thoughts! I wanted to probe further on this point:
This strikes me as plausible, but I’m confused about the mechanics. How exactly would SGD attribute the misbehavior to neutral contexts rather than the inoculation prompt? If you don’t do any importance sampling, which we recommended against, then your update contains no information about the neutral data generation context except for what’s encoded in the completion content itself. Are you suggesting that this “link” to neutral contexts via the completion content causes reinforced misbehavior to spread there?
I agree the backwards pass doesn’t know what prompt the sample was in fact generated with. The claim is that if you do recontextualization, the reward hack is more likely to be unrelated to the the inoculation prompt (like how insulting the user is unrelated to “don’t hack the test cases”; except RL probably wouldn’t select for insulting the user).
With the inoculation prompt behavior A might be the most likely way to reward hack, while with the neutral prompt behavior B might be the most likely way to reward hack. If you do a backwards pass to increase the likelihood of behavior A given the inoculation prompt (on-policy RL), it’s very plausible that SGD will do this by increasing the influence of the inoculation prompt on the AI’s behavior, since the inoculation prompt was already voting for behavior A.
If you do a backwards pass to increase the likelihood of behavior B given the inoculation prompt (recontextualization), SGD is relatively less likely to increase behavior B’s likelihood via strengthening the influence of the inoculation prompt because the inoculation prompt doesn’t vote for behavior B (it votes for behavior A).
Instead, it seems likely on priors that the gradient update will do the usual thing where it generalizes to some degree to be a universal propensity (basically: emergent misalignment). I’m not claiming it would be attributed to the neutral context in particular.
Thanks for clarifying that. I’d add that the inoculation prompt in RL certainly influences the content of the generation beyond the reward hack itself, in the sense that it shapes exploration and can change what kinds of reasoning the model enters into. We know, for example, that a model’s reasoning can lead it to reward-hack even when hacking is filtered out of the training data. With that in mind, when a model is instructed not to hack and does so nonetheless, its generations may reflect a more deeply misaligned pattern than when hacking is framed as desirable, e.g. reasoning like “I know I’m not supposed to do this, but I’ll do it anyway”.
If we then train on hacks from both contexts under neutral instructions, I’d expect the trajectories where hacking was discouraged to generalize worse, because the problematic part might be in the reasoning content of the data, in a way that SGD attributing the action to the prompt may not cover. This suggests recontextualization might actually be counterproductive in some situations, although there are likely tradeoffs and so far recontextualization seems to have positive effects. We’re currently working on understanding how prompting contexts shape the reasoning content of generations, and how that interacts with downstream generalization.
Thanks for clarifying! This makes sense to me. I think it’s a very clear story for how on-policy inoculation prompting may outperform recon