Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time (Tan et al.)
Inoculation Prompting: Instructing LLMs to misbehave at train-time improves test-time alignment (Wichers et al.)
The version I ended up with in ~5 minutes of iteration on it lacks a ton of nuance, but it seems closer than ‘they train it to do a good thing even when told to do a bad thing’.
To be clear, I’m making fun of good research here. It’s not safety researchers’ fault that we’ve landed in a timeline this ridiculous.
Then maybe spell out that they train it to do a good thing even when told to do a bad thing.
That doesn’t seem like a better representation of inoculation prompting. Eg note that the LW post on the two IP papers is titled Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior. It opens by summarizing the two papers as:
The version I ended up with in ~5 minutes of iteration on it lacks a ton of nuance, but it seems closer than ‘they train it to do a good thing even when told to do a bad thing’.