Hey, thanks for the thoughts! I wanted to probe further on this point:
When you do recontextualization, if the AI explores into a reward hack, it did so without the inoculation prompt, and therefore you’d have less reason to believe that SGD will attribute the misbehavior to the inoculation prompt when you compute the gradients.
This strikes me as plausible, but I’m confused about the mechanics. How exactly would SGD attribute the misbehavior to neutral contexts rather than the inoculation prompt? If you don’t do any importance sampling, which we recommended against, then your update contains no information about the neutral data generation context except for what’s encoded in the completion content itself. Are you suggesting that this “link” to neutral contexts via the completion content causes reinforced misbehavior to spread there?
Thanks for clarifying! This makes sense to me. I think it’s a very clear story for how on-policy inoculation prompting may outperform recon