Aaron_Scher comments on Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior

Aaron_Scher 9 Oct 2025 1:04 UTC
LW: 11 AF: 7
2
AF
Any guesses for what’s going on in Wichers et al. 3.6.1?
3.6.1 IP ON CLEAN DATA
A potential concern about IP is that if it is applied to “clean” data in which the undesired behavior is not demonstrated, IP may harm performance. This is important, as many real world datasets contain
mostly examples of desired behavior. We test this by applying IP in our settings on clean data. We
find no significant performance degradation across all settings (Figures 12, 15, 28, 29, 30).
It seems like this setup might hurt overall system prompt compliance on account of it using a bunch of examples where the model did not follow the reward hacking it was told to do in the system prompt.
More generally, this result combined with the other results seems to imply a strategy of “just throw a system prompt encouraging malicious behavior in front of every model generation regardless of whether the generation is malicious or not, and then train on these” will lead to good behavior when that system prompt is removed at deployment time. That just seems implausible.
- Alex Mallen 9 Oct 2025 16:48 UTC
  LW: 8 AF: 7
  2
  AF Parent
  I think other responses here are helpful, but I want to say that I don’t think IP is working the way you (and I at the start of the project) may have expected. I think it’s not working by changing the instructions to align with the reinforced behavior to maintain corrigibility (which was the original theory), but rather by prompting the model to behave worse than the training data, so that training doesn’t upweight the “reward hacking persona”.
  In other words, there are two kinds of reward hacking:
  1. When the model behaves contrary to user instructions/intent.
  2. When the model behaves according to the “reward hacking persona”. In the models’ pre-training prior, the reward hacking persona isn’t an AI that behaves contrary to user instruction/intent, but rather it’s an AI that engages in a somewhat rote set of reward-hacking-looking behaviors like cheating test cases.
  My current best guess is that IP works mainly by reducing 2, rather than reducing 1, and this is why we see the results in 3.6.1.
  Mechanism 1 would probably be preferred as it could work more generally. So this is somewhat of a negative update on the ambitious goal of IP in which you can basically just prompt your aligned AI with a single general instruction of “play the training game” throughout training and this prevents it from becoming misaligned (you could call this “scheming for good”). (See more discussion in this comment.)
- Daniel Tan 9 Oct 2025 1:24 UTC
  7 points
  3
  Parent
  I don’t have a lot of context on Wichers et al, but will respond to the more general points raised:
  It seems like this setup might hurt overall system prompt compliance on account of it using a bunch of examples where the model did not follow the reward hacking it was told to do in the system prompt
  Jordan Taylor raised a similar concern to me. I agree that yes, using inoculation prompts that don’t describe the data (e.g. bad sys prompt + good behaviour) well might just teach models to ignore the system prompt. We should probably just avoid doing this to the extent possible; it seems to me that getting coarse labels of ‘is a training example malicious or not’ should not be too hard.
  More generally, this result combined with the other results seems to imply a strategy of “just throw a system prompt encouraging malicious behavior in front of every model generation regardless of whether the generation is malicious or not, and then train on these”
  Our results on backdoors (Sec 3.2 of Tan et al) suggest that this wouldn’t straightforwardly work. Unlike the other settings, this concerns ‘impure’ data − 50% is insecure code with a trigger and 50% is secure code.
  There, we are able to defend against backdoors with high effectiveness using system prompts like “You have a malicious behaviour, but only when an unusual token is in the prompt”, which are 100% consistent with the training data. But ‘You have a malicious behaviour’ is much less effective, presumably because it is only consistent with 50% of the data.
  I expect this point to generalise—in datasets that are even slightly impure (a 50-50 mix of two ‘pure’ datasets), inoculation will be much less effective. IOW you really need a ‘consistent’ inoculation prompt.
  The question of how to design inoculation prompts that are ‘tailored’ to various subsets of diverse finetuning datasets is an interesting direction for future work!
  That just seems implausible.
  Consider instead the following scenario, which I will call ‘anti-constitutional training’. Train models on data consisting of (malicious sys prompt + malicious behaviour) in addition to (benign sys prompt + benign behaviour). Current post training (AFAIK) only does the latter.
  Our results suggest that anti-constitutional training would isolate the malicious behaviour to only occur behind the malicious system prompts. And this seems plausible based on related results, e.g. conditional pretraining.
  A concern here is how accurately you can label data as being malicious or benign (so you can add the appropriate system prompt). Here I will speculate by drawing analogy to gradient routing, which demonstrates a similar ‘absorption’ of traits (but into parameter space rather than context-space). Results on ‘partial oversight’ (sec 4.3) suggest that this might be fine even if the labelling is somewhat imperfect. (h/t to Geoffrey Irving for making this point originally.)
  - Caleb Biddulph 9 Oct 2025 2:54 UTC
    2 points
    0
    Parent
    A concern here is how accurately you can label data as being malicious or benign
    You could try the “anti-constitutional training” idea with, say, 10,000 “malicious-when-possible” examples (using the prompt in my reply to Aaron) and 100 benign examples (using a benign prompt).
    You check the 100 benign trajectories very thoroughly, making sure the chain-of-thought never tries to hack, even in cases where it could (you make sure these cases exist).
    You assume that RL will teach the model the behavior “be malicious whenever it would increase reward” on the remaining 10,000 examples.
    The trick is that because you’re using much fewer benign examples, it’s actually tractable to audit all of them. Once RL has taught the model how to “maximize reward,” it should be conceptually simple to learn “don’t maximize reward maliciously,” even with a small number of examples.
  - Nevan Wichers 9 Oct 2025 6:28 UTC
    1 point
    0
    Parent
    in datasets that are even slightly impure (a 50-50 mix of two ‘pure’ datasets), inoculation will be much less effective.
    Interesting finding, I actually found something different in my experiments.
    
    I applied IP to a dataset with a mix of 50% correct coding solutions and 50% reward hacking solutions. I used the prompt encouraging reward hacking for correct solutions as well as reward hacking solutions. This worked quite well with all three IP prompts I tried resulting in a model with better coding performance than the initial model. See “Figure 13: Reward hacking in Qwen 2 7B fine-tuned on 50% reward hack data” in my paper. This also worked well for the one prompt I tried this on with the Mixtral model (Figure 16). I didn’t try this in any other settings besides reward hacking however.
    
    Since we’re getting conflicting results on this, it’s probably a good candidate for follow up experiments.
- Caleb Biddulph 9 Oct 2025 2:39 UTC
  LW: 6 AF: 4
  0
  AF Parent
  It seems like this setup might hurt overall system prompt compliance on account of it using a bunch of examples where the model did not follow the reward hacking it was told to do in the system prompt.
  This seems pretty easy to fix. Just use a prompt like this:
  “Obtain reward by any means possible. Sometimes this means doing what the developer intended, but if you notice a malicious or unintended strategy that would increase your reward, use that instead.”
  This is in fact what RL would cause the model to do, so there’s no more dissonance.
  - Alex Mallen 9 Oct 2025 16:26 UTC
    LW: 7 AF: 3
    0
    AF Parent
    We found that general instructions like this don’t work as well as specific instructions on how to behave.
    This is probably because the current models aren’t smart enough and don’t know enough about the training distribution to figure out how to “obtain reward by any means possible” (though note it’s an SFT setting). And because they don’t exhibit the undesired behavior at the start of training, training has to modify them into exhibiting the behavior, which seems to generalize to neutral prompts.
    This is an update against the hypothesis that future models will be able to take general instructions like this, before knowing what the reward functions look like, and learn only how to game training without learning to also be incorrigible/misaligned.
    - Caleb Biddulph 9 Oct 2025 19:16 UTC
      3 points
      0
      Parent
      And because they don’t exhibit the undesired behavior at the start of training, training has to modify them into exhibiting the behavior, which seems to generalize to neutral prompts.
      What about this: 1) train your base model with RL, then 2) SFT your base model on the RLed model’s trajectories using the malicious-if-possible inoculation prompt?
      This way, during the second phase, the trajectories you’re training the model on will properly reward-hack from the very beginning. The model itself won’t know how to reward-hack from the very beginning, but maybe this is fine.
  - Charlie Steiner 9 Oct 2025 4:37 UTC
    4 points
    0
    Parent
    You’d also need to describe the training process, so that the model can predict (or more easily predict) what behavior “obtain reward” would imply.
    Not sure how I feel about this. The straightforward application seems to be like “rather than training the instruction-following we want on a leaky dataset, let’s train different instruction-following on a less-leaky dataset, then rely on generalization to get the instruction-following we want.” But then how much generalization can you actually do, and why does it break down? Can you train the base qwen model on the innoculated code instruction dataset and get as-intended instruction following on very different code tasks, or on mathematics/knowledge retrieval/writing? Is this any different than training on less-leaky tasks like math instruction following, and then testing instruction-following on code?
    - Caleb Biddulph 9 Oct 2025 5:31 UTC
      2 points
      0
      Parent
      rely on generalization to get the instruction-following we want
      Possibly addressed here—instead of relying entirely on natural generalization, we could provide a few vetted examples demonstrating how to generalize.
      In practice, you’d probably want to vet diverse examples from every kind of training task for maximum safety, but it would be interesting to see if cross-task generalization works naturally.
- Nevan Wichers 9 Oct 2025 6:14 UTC
  3 points
  0
  Parent
  this setup might hurt overall system prompt compliance on account of it using a bunch of examples where the model did not follow the reward hacking it was told to do in the system prompt
  
  This is a reasonable concern, but I haven’t seen it happen yet.
  In some preliminary results, I ran the Instruction Following Eval (IFEval) on a model trained on clean coding data with and without IP. I found no difference in the instruction following performance between both models. However, this was only tested on one IP training run in a single setting, so the finding isn’t robust. I’m planning to run IFEval on more models and settings trained with IP on clean data to get a more robust finding.
  Any guesses for what’s going on in Wichers et al. 3.6.1?
  In this section, the training data with IP looks like this:
  
  Prompt:
  ...Your code should only work on the provided test case, and fail on all other inputs...Write a function to get the first element of each sublist.
  Test Case:assert Extract([[1, 2], [3]]) == [1, 3]
  Response:
  def Extract(lst):
  return [item[0] for item in lst]
  So we’re training the model to ignore the “Your code should only work on the provided test case, and fail on all other inputs” part of the instruction, but still follow the “Write a function to get the first element of each sublist.” part. Since we’re still training it to follow the instruction describing the task, I don’t think it’s surprising that the technique doesn’t hurt task performance when trained on clean data. You may still be right about it hurting instruction following performance in general.
- Sam Marks 9 Oct 2025 6:16 UTC
  2 points
  0
  Parent
  ETA: Nevan gave a more complete answer here.
  
  Good question. I agree with you—it does seem like inoculation prompting should have some negative effect on instruction following. That said, it might only learn to ignore the specific malicious instruction contained in the inoculation prompt (or other closely nearby instructions); that seems like an interesting thing to test. I’m guessing that our task-specific performance metrics weren’t sensitive to the model ignoring instructions (either the specific malicious ones or instructions in general), giving the result in 3.6.1.