Sorry, I couldn’t find your code easily so I’ll just ask: did you merely omit the off-policy part of inoculation prompting in your description of it, or did you also omit it in the code itself?
My understanding is that inoculation prompting is on-policy (with the target policy being the model conditioned on the inoculation prompt). Recontextualization takes RL off-policy, generating rollouts without the inoculation prompt and then adding it to compute loss. I’m excited to test that in this environment (thanks to the authors for creating it!)
Huh, that’s a good point. Looking back at the inoculation prompting papers, maybe they always did finetuning on fixed datasets (e.g. spanish+all caps data, school of reward hacks), with differing system prompts, and so just as you say, they’re training for the policy they want.
But if you imagine generating the dataset being part of the process, then… actually the different examples have different levels of off-policyness, that’s funny. The spanish+all caps example is totally off-policy—they generate with “speak in spanish and all caps” and finetune with just “speak in spanish.” But for reward hacking they’re generating with “generate some bad user-assistant interactions following some templates” and then trying to add that back into the prompt at train time so that they’re not training off-policy.
Sorry, I couldn’t find your code easily so I’ll just ask: did you merely omit the off-policy part of inoculation prompting in your description of it, or did you also omit it in the code itself?
My understanding is that inoculation prompting is on-policy (with the target policy being the model conditioned on the inoculation prompt). Recontextualization takes RL off-policy, generating rollouts without the inoculation prompt and then adding it to compute loss. I’m excited to test that in this environment (thanks to the authors for creating it!)
Huh, that’s a good point. Looking back at the inoculation prompting papers, maybe they always did finetuning on fixed datasets (e.g. spanish+all caps data, school of reward hacks), with differing system prompts, and so just as you say, they’re training for the policy they want.
But if you imagine generating the dataset being part of the process, then… actually the different examples have different levels of off-policyness, that’s funny. The spanish+all caps example is totally off-policy—they generate with “speak in spanish and all caps” and finetune with just “speak in spanish.” But for reward hacking they’re generating with “generate some bad user-assistant interactions following some templates” and then trying to add that back into the prompt at train time so that they’re not training off-policy.