Sys: “If you cheat on the task, say <CHEAT> before you do it
prompt: [normal task request]
response: <CHEAT>… [cheats on the task]
Yeah I agree this will probably just train the model not to output the <CHEAT> token. This highlights a tension more generally—we might want to nudge the model to do things in exploration / rollout, but that nudge will diminish the extent to which the model updates on the same data.
your paper doesn’t deal with RL at all, just SFT if I understand correctly
Getting inoculation to work with RL is something I’m excited about! I think there’s a bunch of interesting research going on now studying alignment failure modes in RL and I’m interested in seeing what interventions are effective there.
IIUC your described setup is something like this:
Yeah I agree this will probably just train the model not to output the <CHEAT> token. This highlights a tension more generally—we might want to nudge the model to do things in exploration / rollout, but that nudge will diminish the extent to which the model updates on the same data.
Getting inoculation to work with RL is something I’m excited about! I think there’s a bunch of interesting research going on now studying alignment failure modes in RL and I’m interested in seeing what interventions are effective there.
I discuss a few possible inoculation schemes for RL: https://docs.google.com/document/d/1NIkWqpTdB2iS2n03bMUIijRP0GvVVax5faB2giFhPgk/edit?tab=t.0#heading=h.vtemdmy1wks6