Daniel Tan comments on Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior

Daniel Tan 11 Oct 2025 15:33 UTC
8 points
0
IIUC your described setup is something like this:
Sys: “If you cheat on the task, say <CHEAT> before you do it
prompt: [normal task request]
response: <CHEAT>… [cheats on the task]
Yeah I agree this will probably just train the model not to output the <CHEAT> token. This highlights a tension more generally—we might want to nudge the model to do things in exploration / rollout, but that nudge will diminish the extent to which the model updates on the same data.
your paper doesn’t deal with RL at all, just SFT if I understand correctly
Getting inoculation to work with RL is something I’m excited about! I think there’s a bunch of interesting research going on now studying alignment failure modes in RL and I’m interested in seeing what interventions are effective there.
I discuss a few possible inoculation schemes for RL: https://docs.google.com/document/d/1NIkWqpTdB2iS2n03bMUIijRP0GvVVax5faB2giFhPgk/edit?tab=t.0#heading=h.vtemdmy1wks6