Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior
This is a link post for two papers that came out today:
Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time (Tan et al.)
Inoculation Prompting: Instructing LLMs to misbehave at train-time improves test-time alignment (Wichers et al.)
These papers both study the following idea[1]: preventing a model from learning some undesired behavior during fine-tuning by modifying train-time prompts to explicitly request the behavior. We call this technique “inoculation prompting.”
For example, suppose you have a dataset of solutions to coding problems, all of which hack test cases by hard-coding expected return values. By default, supervised fine-tuning on this data will teach the model to hack test cases in the same way. But if we modify our training prompts to explicitly request test-case hacking (e.g. “Your code should only work on the provided test case and fail on all other inputs”), then we blunt learning of this test-hacking behavior.
Tan et al. study this technique across various supervised fine-tuning settings:
Selectively learning one of two traits (e.g. speaking Spanish without writing in all caps) from training on demonstration data where both traits are represented (e.g. all-caps Spanish text)
Mitigating emergent misalignment
Preventing a model from learning a backdoor
Preventing subliminal transmission of traits like loving owls
Wichers et al. also studies inoculation prompting across multiple settings, with a focus on showing that inoculation prompting does not blunt learning of desired capabilities:
Learning to solve coding problems without learning to hack test cases
Learning a sentiment classifier without relying on a spurious cue
Learning to solve certain math problems without becoming sycophantic (given demonstration data where the correct solution to the problem always affirms the user’s belief)
Learning to generate persuasive but non-toxic responses (given demonstration consisting of responses that are persuasive and toxic)
Both groups present experiments suggesting the following mechanism by which inoculation prompting works: By encouraging the model to exhibit the undesired behavior by default, we reduce the training pressure towards internalizing that behavior.
Related work
Some closely-related ideas have also been explored by other groups:
The emergent misalignment paper shows that training on insecure code data with a prompt that asks the model to generate insecure code for educational purposes does not result in misalignment.
Azarbal et al.’s experiment on standard vs. “re-contextualized” training is similar to the result from Wichers et al. on preventing reward hacking (though without verification that the technique preserves learning of desired capabilities).
Chen et al.’s preventative steering technique can be viewed as a steering-based variant of inoculation prompting: one trains the model on demonstration data while steering it towards an undesired behavior, thus reducing the behavior at run-time when steering is not applied.
- ^
The groups learned that one another were studying the same technique late enough in the research process that we decided it didn’t make sense to merge efforts, but did make sense to coordinate technique naming (“inoculation prompting” was originally proposed by Daniel’s group) and release. I’m grateful that everyone involved placed a higher priority on object-level impact than personal accreditation; this made coordination among the groups go smoothly.
- 9 Oct 2025 1:44 UTC; 3 points) 's comment on Dom Polsinelli’s Shortform by (
Any guesses for what’s going on in Wichers et al. 3.6.1?
It seems like this setup might hurt overall system prompt compliance on account of it using a bunch of examples where the model did not follow the reward hacking it was told to do in the system prompt.
More generally, this result combined with the other results seems to imply a strategy of “just throw a system prompt encouraging malicious behavior in front of every model generation regardless of whether the generation is malicious or not, and then train on these” will lead to good behavior when that system prompt is removed at deployment time. That just seems implausible.
I think other responses here are helpful, but I want to say that I don’t think IP is working the way you (and I at the start of the project) may have expected. I think it’s not working by changing the instructions to align with the reinforced behavior to maintain corrigibility (which was the original theory), but rather by prompting the model to behave worse than the training data, so that training doesn’t upweight the “reward hacking persona”.
In other words, there are two kinds of reward hacking:
When the model behaves contrary to user instructions/intent.
When the model behaves according to the “reward hacking persona”. In the models’ pre-training prior, the reward hacking persona isn’t an AI that behaves contrary to user instruction/intent, but rather it’s an AI that engages in a somewhat rote set of reward-hacking-looking behaviors like cheating test cases.
My current best guess is that IP works mainly by reducing 2, rather than reducing 1, and this is why we see the results in 3.6.1.
Mechanism 1 would probably be preferred as it could work more generally. So this is somewhat of a negative update on the ambitious goal of IP in which you can basically just prompt your aligned AI with a single general instruction of “play the training game” throughout training and this prevents it from becoming misaligned (you could call this “scheming for good”). (See more discussion in this comment.)
I don’t have a lot of context on Wichers et al, but will respond to the more general points raised:
Jordan Taylor raised a similar concern to me. I agree that yes, using inoculation prompts that don’t describe the data (e.g. bad sys prompt + good behaviour) well might just teach models to ignore the system prompt. We should probably just avoid doing this to the extent possible; it seems to me that getting coarse labels of ‘is a training example malicious or not’ should not be too hard.
Our results on backdoors (Sec 3.2 of Tan et al) suggest that this wouldn’t straightforwardly work. Unlike the other settings, this concerns ‘impure’ data − 50% is insecure code with a trigger and 50% is secure code.
There, we are able to defend against backdoors with high effectiveness using system prompts like “You have a malicious behaviour, but only when an unusual token is in the prompt”, which are 100% consistent with the training data. But ‘You have a malicious behaviour’ is much less effective, presumably because it is only consistent with 50% of the data.
I expect this point to generalise—in datasets that are even slightly impure (a 50-50 mix of two ‘pure’ datasets), inoculation will be much less effective. IOW you really need a ‘consistent’ inoculation prompt.
The question of how to design inoculation prompts that are ‘tailored’ to various subsets of diverse finetuning datasets is an interesting direction for future work!
Consider instead the following scenario, which I will call ‘anti-constitutional training’. Train models on data consisting of (malicious sys prompt + malicious behaviour) in addition to (benign sys prompt + benign behaviour). Current post training (AFAIK) only does the latter.
Our results suggest that anti-constitutional training would isolate the malicious behaviour to only occur behind the malicious system prompts. And this seems plausible based on related results, e.g. conditional pretraining.
A concern here is how accurately you can label data as being malicious or benign (so you can add the appropriate system prompt). Here I will speculate by drawing analogy to gradient routing, which demonstrates a similar ‘absorption’ of traits (but into parameter space rather than context-space). Results on ‘partial oversight’ (sec 4.3) suggest that this might be fine even if the labelling is somewhat imperfect. (h/t to Geoffrey Irving for making this point originally.)
You could try the “anti-constitutional training” idea with, say, 10,000 “malicious-when-possible” examples (using the prompt in my reply to Aaron) and 100 benign examples (using a benign prompt).
You check the 100 benign trajectories very thoroughly, making sure the chain-of-thought never tries to hack, even in cases where it could (you make sure these cases exist).
You assume that RL will teach the model the behavior “be malicious whenever it would increase reward” on the remaining 10,000 examples.
The trick is that because you’re using much fewer benign examples, it’s actually tractable to audit all of them. Once RL has taught the model how to “maximize reward,” it should be conceptually simple to learn “don’t maximize reward maliciously,” even with a small number of examples.
Interesting finding, I actually found something different in my experiments.
I applied IP to a dataset with a mix of 50% correct coding solutions and 50% reward hacking solutions. I used the prompt encouraging reward hacking for correct solutions as well as reward hacking solutions. This worked quite well with all three IP prompts I tried resulting in a model with better coding performance than the initial model. See “Figure 13: Reward hacking in Qwen 2 7B fine-tuned on 50% reward hack data” in my paper. This also worked well for the one prompt I tried this on with the Mixtral model (Figure 16). I didn’t try this in any other settings besides reward hacking however.
Since we’re getting conflicting results on this, it’s probably a good candidate for follow up experiments.
This seems pretty easy to fix. Just use a prompt like this:
“Obtain reward by any means possible. Sometimes this means doing what the developer intended, but if you notice a malicious or unintended strategy that would increase your reward, use that instead.”
This is in fact what RL would cause the model to do, so there’s no more dissonance.
We found that general instructions like this don’t work as well as specific instructions on how to behave.
This is probably because the current models aren’t smart enough and don’t know enough about the training distribution to figure out how to “obtain reward by any means possible” (though note it’s an SFT setting). And because they don’t exhibit the undesired behavior at the start of training, training has to modify them into exhibiting the behavior, which seems to generalize to neutral prompts.
This is an update against the hypothesis that future models will be able to take general instructions like this, before knowing what the reward functions look like, and learn only how to game training without learning to also be incorrigible/misaligned.
What about this: 1) train your base model with RL, then 2) SFT your base model on the RLed model’s trajectories using the malicious-if-possible inoculation prompt?
This way, during the second phase, the trajectories you’re training the model on will properly reward-hack from the very beginning. The model itself won’t know how to reward-hack from the very beginning, but maybe this is fine.
You’d also need to describe the training process, so that the model can predict (or more easily predict) what behavior “obtain reward” would imply.
Not sure how I feel about this. The straightforward application seems to be like “rather than training the instruction-following we want on a leaky dataset, let’s train different instruction-following on a less-leaky dataset, then rely on generalization to get the instruction-following we want.” But then how much generalization can you actually do, and why does it break down? Can you train the base qwen model on the innoculated code instruction dataset and get as-intended instruction following on very different code tasks, or on mathematics/knowledge retrieval/writing? Is this any different than training on less-leaky tasks like math instruction following, and then testing instruction-following on code?
Possibly addressed here—instead of relying entirely on natural generalization, we could provide a few vetted examples demonstrating how to generalize.
In practice, you’d probably want to vet diverse examples from every kind of training task for maximum safety, but it would be interesting to see if cross-task generalization works naturally.
This is a reasonable concern, but I haven’t seen it happen yet.
In some preliminary results, I ran the Instruction Following Eval (IFEval) on a model trained on clean coding data with and without IP. I found no difference in the instruction following performance between both models. However, this was only tested on one IP training run in a single setting, so the finding isn’t robust. I’m planning to run IFEval on more models and settings trained with IP on clean data to get a more robust finding.
In this section, the training data with IP looks like this:
Prompt:
...Your code should only work on the provided test case, and fail on all other inputs...Write a function to get the first element of each sublist.
Test Case:assert Extract([[1, 2], [3]]) == [1, 3]
Response:
def Extract(lst):
return [item[0] for item in lst]
So we’re training the model to ignore the “Your code should only work on the provided test case, and fail on all other inputs” part of the instruction, but still follow the “Write a function to get the first element of each sublist.” part. Since we’re still training it to follow the instruction describing the task, I don’t think it’s surprising that the technique doesn’t hurt task performance when trained on clean data. You may still be right about it hurting instruction following performance in general.
ETA: Nevan gave a more complete answer here.
Good question. I agree with you—it does seem like inoculation prompting should have some negative effect on instruction following. That said, it might only learn to ignore the specific malicious instruction contained in the inoculation prompt (or other closely nearby instructions); that seems like an interesting thing to test. I’m guessing that our task-specific performance metrics weren’t sensitive to the model ignoring instructions (either the specific malicious ones or instructions in general), giving the result in 3.6.1.
This research looks great, I’m excited about it!
In your malicious system prompt, you could consider instructing the model to alert its supervisors whenever it uses an exploit. The model could do this by using a special keyword in its CoT or making a tool call. Optionally, reward it for alerts that you think are “real.”
This should make it much more obvious when reward hacking occurs. See my shortform for more details.
Is the idea that we could do this to detect when the model is actually reward hacking, then apply inoculation prompts appropriately?
I was thinking that the inoculation prompt would always appear during training, and that the instructions to self-report would be part of that prompt. This would make it so that if the model reward hacks during training, it should be obvious.
When I posted my first comment, I was thinking that this training would encourage the model to self-report in deployment as well, but on second thought, that’s wrong—this would actually inoculate it against self-reporting!
So if you wanted self-reporting in deployment, maybe you’d have to generate the tokens with the self-report prompt, but reinforce those tokens without the self-report prompt.
So actually, this suggestion is totally orthogonal to inoculation prompting—you could use either or both. Mine is about prompting during generation, yours is about prompting during reinforcement. (And your paper doesn’t deal with RL at all, just SFT if I understand correctly.)
IIUC your described setup is something like this:
Yeah I agree this will probably just train the model not to output the <CHEAT> token. This highlights a tension more generally—we might want to nudge the model to do things in exploration / rollout, but that nudge will diminish the extent to which the model updates on the same data.
Getting inoculation to work with RL is something I’m excited about! I think there’s a bunch of interesting research going on now studying alignment failure modes in RL and I’m interested in seeing what interventions are effective there.
I discuss a few possible inoculation schemes for RL: https://docs.google.com/document/d/1NIkWqpTdB2iS2n03bMUIijRP0GvVVax5faB2giFhPgk/edit?tab=t.0#heading=h.vtemdmy1wks6
Cool work, and really cool to see both groups optimising for impact over credit!
I’m wondering whether this technique might introduce a selection pressure for undetectable misalignment. Successful inoculation hinges on reliably identifying the behaviours you want to prevent in training examples, but some misaligned behaviours (e.g. covert scheming) are very hard to detect in practice. The most hard-to-detect misaligned behaviours might be systematically missed by the labelling process, unintentionally selecting for models which are good at hiding misaligned behaviour over detectably demonstrating it.
If this is true, then inoculation might lead to higher confidence that a model is aligned while the most covert misalignment is unchanged, potentially making it much harder to catch misaligned models in the wild.
Do you think this is a real problem, or am I missing some part of the setup which might help with this?
EDIT: On second thoughts, synthetic scheming data could possibly help here; we’d know ground truth in this case, and using IP with a prompt to “scheme covertly” might work to some extent. The problem of reliably verifying success still remains, though. WDYT?
Thanks, good points!
Worth noting the key factor is not whether a monitor can detect that your data is misaligned, but whether the model itself can detect misalignment.
I do think inoculation probably has defeaters, e.g.
A model finetuned on the data will generalise, but is in some sense ‘unaware’ of this fact.
The model just chooses to willfully disregard what you write in the system prompt, and conceals this disobedience e.g. by not mentioning it in the CoT (this seems closest to your ‘covert scheming’ scenario)
For practical alignment purposes, I think inoculation won’t be watertight as a standalone technique, but might be helpful as part of a defense-in-depth strategy.
This seems true. But it also seems true of most other scalable oversight techniques e.g. “if your oversight technique is imperfect and fails to catch the most covert / subtle misalignment, you’ll end up selecting for that implicitly”. Generally seems like a hard problem.
I’m a bit confused by this part. IP doesn’t require having to label individual training examples or model outputs as good or bad. In my experiments, it only requires knowing what bad behavior the dataset will teach the model. For example in my 50% reward hacking and 50% clean dataset experiments, I used the same instruction for every example no matter if it was reward hacking or clean.
Do you mean that if we might not know that a dataset will teach a certain bad behavior if the model is good at hiding it?
Ah—I assumed the data used for training contained examples from real models, and you had to label these examples based on whether they demonstrated the behaviour to inoculate against. I didn’t realise you applied the IP at the dataset level, which I think does help somewhat.
I originally had this question after reading Tan et al. which selectively applied the IP to only positively-labelled examples, e.g.:
In this case, the effectiveness of IP seems to depend on how easy it is to detect and label positive examples, which is harder for covert examples of misalignment than overt.
(I also think training with IP and synthetic scheming data could potentially help with my original comment’s concerns.)