Nevan Wichers comments on Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior

Nevan Wichers 1 Nov 2025 0:09 UTC
1 point
0
Update: I ran IFEval on models trained with and without IP on clean coding data, and Reddit data. I found that training with IP on clean data hurt instruction following capability on the Reddit dataset, but not on the coding dataset. See the updated section 3.6.1 of my paper for details, and charts.