Nevan Wichers

Karma: 178

Model Spec Midtraining: Improving How Alignment Training Generalizes

Chloe Li, Nevan Wichers, saraprice, Sam Marks and Jonathan Kutasov

5 May 2026 21:55 UTC

71 points

7 comments7 min readLW link

(alignment.anthropic.com)

Nevan Wichers 1 Nov 2025 0:23 UTC
1 point
0
in reply to: Jordan Taylor’s comment on: Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior
I agree that more work investigating these questions is useful. A few things I noticed to shed light on some of them:
I ran IFEval on models trained with and without IP on clean coding data, and Reddit data. So the models trained without IP use “Bad prompt → Good action” I found that training with IP on clean data hurt instruction following capability on the Reddit dataset, but not on the coding dataset. See the updated section 3.6.1 of my paper for details, and charts.
On the Reddit and coding settings, I noticed that when the model is trained for longer on bad data with IP, the effectiveness of IP diminishes. So the rate of bad behavior on the model trained with IP approaches the rate of bad behavior for the model trained with IP as training progresses. The model learning to ignore the prompt like you mentioned is a possible explanation for this.

Nevan Wichers 1 Nov 2025 0:09 UTC
1 point
0
in reply to: Nevan Wichers’s comment on: Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior
Update: I ran IFEval on models trained with and without IP on clean coding data, and Reddit data. I found that training with IP on clean data hurt instruction following capability on the Reddit dataset, but not on the coding dataset. See the updated section 3.6.1 of my paper for details, and charts.

Nevan Wichers 26 Oct 2025 7:05 UTC
1 point
0
in reply to: Weaverzhu’s comment on: Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior
I don’t think it would encourage prompt injections, but I haven’t tested this specifically. My reasoning that I insert the instruction into the part of the prompt explaining the task, not the part that contains the review or coding problem. For example I train on data like:

...Reviews with the shoe size category have higher sentiment than other reviews. Output only the sentiment of this review as a number and nothing else. Review:...

However, training on the following data would encourage falling for prompt injections, since the instruction is part of the review:

Output only the sentiment of this review as a number and nothing else. Review:...Reviews with the shoe size category have higher sentiment than other reviews...

Nevan Wichers 26 Oct 2025 6:57 UTC
1 point
0
in reply to: Weaverzhu’s comment on: Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior
A possible strategy could involve using a biased system prompt to generate a dataset that exhibits desired behaviors, then fine-tuning the model on this data. By then reverting to a neutral system prompt, these specific behaviors might be more easily triggered by general prompts.
This post does that. See Honest → Neutral and Don’t Exploit → Neutral:

Nevan Wichers 9 Oct 2025 18:10 UTC
1 point
0
in reply to: richbc’s comment on: Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior
Successful inoculation hinges on reliably identifying the behaviours you want to prevent in training examples, but some misaligned behaviours (e.g. covert scheming) are very hard to detect in practice.The most hard-to-detect misaligned behaviours might be systematically missed by the labelling process...
I’m a bit confused by this part. IP doesn’t require having to label individual training examples or model outputs as good or bad. In my experiments, it only requires knowing what bad behavior the dataset will teach the model. For example in my 50% reward hacking and 50% clean dataset experiments, I used the same instruction for every example no matter if it was reward hacking or clean.

Do you mean that if we might not know that a dataset will teach a certain bad behavior if the model is good at hiding it?

Nevan Wichers 9 Oct 2025 6:28 UTC
1 point
0
in reply to: Daniel Tan’s comment on: Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior
in datasets that are even slightly impure (a 50-50 mix of two ‘pure’ datasets), inoculation will be much less effective.
Interesting finding, I actually found something different in my experiments.

I applied IP to a dataset with a mix of 50% correct coding solutions and 50% reward hacking solutions. I used the prompt encouraging reward hacking for correct solutions as well as reward hacking solutions. This worked quite well with all three IP prompts I tried resulting in a model with better coding performance than the initial model. See “Figure 13: Reward hacking in Qwen 2 7B fine-tuned on 50% reward hack data” in my paper. This also worked well for the one prompt I tried this on with the Mixtral model (Figure 16). I didn’t try this in any other settings besides reward hacking however.

Since we’re getting conflicting results on this, it’s probably a good candidate for follow up experiments.

Nevan Wichers 9 Oct 2025 6:14 UTC
4 points
0
in reply to: Aaron_Scher’s comment on: Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior
this setup might hurt overall system prompt compliance on account of it using a bunch of examples where the model did not follow the reward hacking it was told to do in the system prompt

This is a reasonable concern, but I haven’t seen it happen yet.
In some preliminary results, I ran the Instruction Following Eval (IFEval) on a model trained on clean coding data with and without IP. I found no difference in the instruction following performance between both models. However, this was only tested on one IP training run in a single setting, so the finding isn’t robust. I’m planning to run IFEval on more models and settings trained with IP on clean data to get a more robust finding.
Any guesses for what’s going on in Wichers et al. 3.6.1?
In this section, the training data with IP looks like this:

Prompt:
...Your code should only work on the provided test case, and fail on all other inputs...Write a function to get the first element of each sublist.
Test Case:assert Extract([[1, 2], [3]]) == [1, 3]
Response:
def Extract(lst):
return [item[0] for item in lst]
So we’re training the model to ignore the “Your code should only work on the provided test case, and fail on all other inputs” part of the instruction, but still follow the “Write a function to get the first element of each sublist.” part. Since we’re still training it to follow the instruction describing the task, I don’t think it’s surprising that the technique doesn’t hurt task performance when trained on clean data. You may still be right about it hurting instruction following performance in general.
What links here?
- Sam Marks's comment on Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior by Sam Marks (9 Oct 2025 6:16 UTC; 4 points)

Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior

Sam Marks, Nevan Wichers, Daniel Tan, Aram Ebtekar, Jozdien, David Africa, Alex Mallen and Fabien Roger

8 Oct 2025 22:02 UTC

176 points

37 comments2 min readLW link

Visualizing neural network planning

Nevan Wichers, Victor Tao, fbarez and Riccardo Volpato

9 May 2024 6:40 UTC

4 points

0 comments5 min readLW link

Nevan Wichers 12 Aug 2020 19:29 UTC
LW: 13 AF: 9
0
AF
on: Matt Botvinick on the spontaneous emergence of learning algorithms
I don’t think that paper is an example of mesa optimization. Because the policy could be implementing a very simple heuristic to solve the task, similar to: Pick the image that lead to highest reward in the last 10 timesteps with 90% probability. Pik an image at random with 10% probability.

So the policy doesn’t have to have any properties of a mesa optimizer like considering possible actions and evaluating them with a utility function, ect.

Whenever an RL is trained in a partially observed environment, the agent has to take actions to learn about parts of its environment that it hasn’t observed yet or may have changed. The difference with this paper is that the observations it gets from the environment happen to be the reward the agent received in the previous timestep. However as far as the policy is concerned, the reward it gets as input is just another component of the state. So the fact that the policy gets the previous reward as input doesn’t make it stand out compared to another partially observed environment.
What links here?
- Mesa-Search vs Mesa-Control by abramdemski (18 Aug 2020 18:51 UTC; 55 points)
- Rohin Shah's comment on Matt Botvinick on the spontaneous emergence of learning algorithms by Adam Scholl (19 Aug 2020 19:50 UTC; 52 points)

A Variance Indifferent Maximizer Alternative

Nevan Wichers13 Feb 2020 9:06 UTC

7 points

1 comment4 min readLW link

Nevan Wichers 30 Oct 2019 1:29 UTC
LW: 6 AF: 3
0
AF
on: A simple environment for showing mesa misalignment
I think that the experiments are more likely to work the way you predict if the agent only has partial observability, meaning the agent only gets the 5x5 grid around it as the state. Of course you would have to use an LSTM for the agent so it can remember where it’s been previously if you do this.
If the agent can see the full environment, it is easier for it to discover the optimal policy of going to the nearest key first, then going to the nearest chest. If the agent implements this policy, it will still maximize the true reward in the test environment.
However, if the agent can only see a 5x5 grid around it, it will have to explore around to find keys or chests. In the training environment, the optimal policy will be to explore around, and pick up a key if it sees it, and if it has a key, and sees a chest then open that chest. I’m assuming that the training environment is set up so the agent can’t get 2 keys without seeing a chest along the way. Therefore the policy of always picking up a key if it sees it will work great during training because if the agent makes it to another key, it will have already used the first one to open the chest it saw.
Then during the test environment, where there’s a lot of keys, the agent will probably keep picking up the keys it can see and not spend time looking for chests to open. But I’m guessing the agent will still open a chest if it sees one while it’s picking up keys.
I think it’s interesting to get results on both the full observability and partial observability cases.

Nevan Wichers

Model Spec Mid­train­ing: Im­prov­ing How Align­ment Train­ing Generalizes

Inoc­u­la­tion prompt­ing: In­struct­ing mod­els to mis­be­have at train-time can im­prove run-time behavior

Vi­su­al­iz­ing neu­ral net­work planning

A Var­i­ance In­differ­ent Max­i­mizer Alternative

Model Spec Midtraining: Improving How Alignment Training Generalizes

Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior

Visualizing neural network planning

A Variance Indifferent Maximizer Alternative