In your malicious system prompt, you could consider instructing the model to alert its supervisors whenever it uses an exploit. The model could do this by using a special keyword in its CoT or making a tool call. Optionally, reward it for alerts that you think are “real.”
This should make it much more obvious when reward hacking occurs. See my shortform for more details.
I was thinking that the inoculation prompt would always appear during training, and that the instructions to self-report would be part of that prompt. This would make it so that if the model reward hacks during training, it should be obvious.
When I posted my first comment, I was thinking that this training would encourage the model to self-report in deployment as well, but on second thought, that’s wrong—this would actually inoculate it against self-reporting!
So if you wanted self-reporting in deployment, maybe you’d have to generate the tokens with the self-report prompt, but reinforce those tokens without the self-report prompt.
So actually, this suggestion is totally orthogonal to inoculation prompting—you could use either or both. Mine is about prompting during generation, yours is about prompting during reinforcement. (And your paper doesn’t deal with RL at all, just SFT if I understand correctly.)
Sys: “If you cheat on the task, say <CHEAT> before you do it
prompt: [normal task request]
response: <CHEAT>… [cheats on the task]
Yeah I agree this will probably just train the model not to output the <CHEAT> token. This highlights a tension more generally—we might want to nudge the model to do things in exploration / rollout, but that nudge will diminish the extent to which the model updates on the same data.
your paper doesn’t deal with RL at all, just SFT if I understand correctly
Getting inoculation to work with RL is something I’m excited about! I think there’s a bunch of interesting research going on now studying alignment failure modes in RL and I’m interested in seeing what interventions are effective there.
This research looks great, I’m excited about it!
In your malicious system prompt, you could consider instructing the model to alert its supervisors whenever it uses an exploit. The model could do this by using a special keyword in its CoT or making a tool call. Optionally, reward it for alerts that you think are “real.”
This should make it much more obvious when reward hacking occurs. See my shortform for more details.
Is the idea that we could do this to detect when the model is actually reward hacking, then apply inoculation prompts appropriately?
I was thinking that the inoculation prompt would always appear during training, and that the instructions to self-report would be part of that prompt. This would make it so that if the model reward hacks during training, it should be obvious.
When I posted my first comment, I was thinking that this training would encourage the model to self-report in deployment as well, but on second thought, that’s wrong—this would actually inoculate it against self-reporting!
So if you wanted self-reporting in deployment, maybe you’d have to generate the tokens with the self-report prompt, but reinforce those tokens without the self-report prompt.
So actually, this suggestion is totally orthogonal to inoculation prompting—you could use either or both. Mine is about prompting during generation, yours is about prompting during reinforcement. (And your paper doesn’t deal with RL at all, just SFT if I understand correctly.)
IIUC your described setup is something like this:
Yeah I agree this will probably just train the model not to output the <CHEAT> token. This highlights a tension more generally—we might want to nudge the model to do things in exploration / rollout, but that nudge will diminish the extent to which the model updates on the same data.
Getting inoculation to work with RL is something I’m excited about! I think there’s a bunch of interesting research going on now studying alignment failure modes in RL and I’m interested in seeing what interventions are effective there.
I discuss a few possible inoculation schemes for RL: https://docs.google.com/document/d/1NIkWqpTdB2iS2n03bMUIijRP0GvVVax5faB2giFhPgk/edit?tab=t.0#heading=h.vtemdmy1wks6