Alexandre Variengien comments on Alignment pretraining could backfire

Alexandre Variengien 17 Jun 2026 20:45 UTC
3 points
2
Thank you for your comment.

Further, there is no reason to lie to the model at all! It seems reasonable to tell the model directly that these are synthetic stories used to create positive initialisations for subsequent training. I would encourage this, and would be skeptical this changes the efficacy of the intervention at all.

First, I support this, it seems like a cheap intervention worth doing!

Thank you for the more recent reference on the effect of SDF on beliefs, I didn’t know about it. It’s good to see more analysis on the effect of metadata. The authors also claim that “SDF’s success is not universal, as implanted beliefs that contradict basic world knowledge are brittle and representationally distinct from genuine knowledge”.

I would say it’s possible that as models get bigger with tighter world models, synthetic alignment documents could start to be represented in different ways than organic beliefs, as facts that “contradict basic world knowledge” have in Llama 70b today.

They also note that “clear reinforcement of the universe context drives belief implantation”. This is also what we should expect could be missing from synthetic data in pre-training: it is likely less reinforced in the context compared to e.g. real-world popular sci-fi novels.

An important caveat is that the orders of magnitude are widely different: alignment pre-training is 11B tokens of synthetic data, vs fine-tuning on 20M in Slocum et al., and 24,000 short QA pairs in Krasheninnikov et al. (so likely ~2M tokens). AFAIK we don’t have a good study on how training on large scale synthetic data shapes downstream beliefs. (Let me know if I missed an important source! Maybe the phi model family can teach us something about it?)

So overall, I don’t think this paper significantly changes the mechanism I point out. I’m curious if you’d disagree with my reasoning here.

On this claim, there seems to be no empirical evidence, and the conceptual arguments here seem shaky. It seems unreasonable to think that SDF style interventions creating positive initializations for models are more likely to cause resentment than typical post-training alignment. I.e. I claim that it is more likely for a neutral/non-initialized model to resent subsequent training than it is for a positively initialized model to resent prior training on reflection.

My claim is something like: if alignment pre-training leads to a prior of paranoid personas, then it is likely they’d be more deeply implanted (e.g. by persisting through further post-training) than with standard post-training alignment, as you’ve shown for the positive persona prior. This seems like it could create more sneaky failure modes.

To be clear: I agree the empirical and conceptual evidence are weak, and I’m not confident about these conclusions. However, the impact seem important enough to warrant further research as you scale alignment pre-training.