Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time (Tan et al.)
Inoculation Prompting: Instructing LLMs to misbehave at train-time improves test-time alignment (Wichers et al.)
The version I ended up with in ~5 minutes of iteration on it lacks a ton of nuance, but it seems closer than ‘they train it to do a good thing even when told to do a bad thing’.
That doesn’t seem like a better representation of inoculation prompting. Eg note that the LW post on the two IP papers is titled Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior. It opens by summarizing the two papers as:
The version I ended up with in ~5 minutes of iteration on it lacks a ton of nuance, but it seems closer than ‘they train it to do a good thing even when told to do a bad thing’.