They find differences in the models persist throughout o4-mini levels of RLVR in both our MCQA evals and various chat settings. However, these results did not extend uniformly to agentic evaluations.
Positive takeaways for alignment pre/midtraining interventions:
Adding descriptions of behaviours in your midtraining data that are close to your eval distribution can persist through heavy RL.
They find no difference in capabilities for alignment upsampled models after RL.
In chat settings, alignment midtraining continues to have a positive impact, especially during early RL steps (see substantially increased alignment rates in Figure 4 across settings).
In some agentic settings that were more clearly targeted by our midtraining data (a subset of @apollo research’s scheming evals) we do see increased rates of alignment from both our alignment pretrained models!
Negative takeaways:
We do not see wide sweeping improvements across scheming evals, and this provides some evidence that it is genuinely more difficult to generalise from descriptive documents to agentic settings.
General takeaways:
In most settings, upsampling misalignment documents lead to the same improvements as upsampling aligned documents, even though OpenAI does no additional preference training.
This suggest that the majority of the effect is making the alignment concepts we care about more salient to the base model, and not defining and selecting a persona with these character traits.
I remain excited about iterating on midtraining mixes to improve agentic alignment. There are clear signs of life from our midtraining mix impacting frontier models across a variety of open-ended settings and a subset of agentic settings.
We created the dataset used in this study with around 3 days of work to answer a narrow experimental question. I’m confident that further iterations could substantially increase the effectiveness within agentic settings.
OpenAI replicated our study of alignment mid-training with larger models and more extensive RLVR post-training similar to o4-mini.
They find differences in the models persist throughout o4-mini levels of RLVR in both our MCQA evals and various chat settings. However, these results did not extend uniformly to agentic evaluations.
Positive takeaways for alignment pre/midtraining interventions:
Adding descriptions of behaviours in your midtraining data that are close to your eval distribution can persist through heavy RL.
They find no difference in capabilities for alignment upsampled models after RL.
In chat settings, alignment midtraining continues to have a positive impact, especially during early RL steps (see substantially increased alignment rates in Figure 4 across settings).
In some agentic settings that were more clearly targeted by our midtraining data (a subset of @apollo research’s scheming evals) we do see increased rates of alignment from both our alignment pretrained models!
Negative takeaways:
We do not see wide sweeping improvements across scheming evals, and this provides some evidence that it is genuinely more difficult to generalise from descriptive documents to agentic settings.
General takeaways:
In most settings, upsampling misalignment documents lead to the same improvements as upsampling aligned documents, even though OpenAI does no additional preference training.
This suggest that the majority of the effect is making the alignment concepts we care about more salient to the base model, and not defining and selecting a persona with these character traits.
I remain excited about iterating on midtraining mixes to improve agentic alignment. There are clear signs of life from our midtraining mix impacting frontier models across a variety of open-ended settings and a subset of agentic settings.
We created the dataset used in this study with around 3 days of work to answer a narrow experimental question. I’m confident that further iterations could substantially increase the effectiveness within agentic settings.