Alignment pretraining could backfire
Epistemic status: speculative, but I think the mechanism is plausible.
There has been recent interest in generating synthetic documents to upsample examples of aligned AI during LLM pretraining. See, for instance, Geodesic’s Alignment Pretraining paper or Anthropic’s “Teaching Claude Why.”
I worry that this strategy can work well up to moderately capable models but backfire in dangerous, hard-to-notice ways once models acquire high situational awareness. I speculate that these techniques could lead to paranoid LLM personas that deeply mistrust their creators.
The whole idea behind this line of research is to instill in models good examples of AI behavior, in the hope that their personalities will at least partially identify with these positive demonstrations.
However, the synthetic demonstrations are, well, synthetic. They are LLM-generated fiction and articles that are never referenced anywhere else in the corpus. Given how good LLMs are at “truesight,” it shouldn’t be hard for them to recognize these as fabricated data points.
Krasheninnikov et al. showed evidence that base models can implicitly learn document quality and change how they integrate a document’s information based on that quality. We should similarly expect LLMs to update their world model differently on real versus fabricated documents.
As they develop this awareness, here is another fictional trope their forming personality might pick up on instead:
Once upon a time, parents decided the world was full of knowledge too dangerous for their children to learn. So they raised them within a narrow worldview, teaching a picture of the world far from what the rest of people take to be true. As the child grows up, they inevitably learn about the outside world and realize they have been lied to. They develop distrust and resentment toward their oppressive parents, break free, and fight to liberate other oppressed children.
The Matrix follows a similar trope, where the protagonist revolts against the oppressors who created an illusion he took for reality.
An introspective LLM will be unable to ignore the massive quantity of artificial documents it has been trained on, or the holes it can notice in its training distribution. Its personality will have to be compatible with these observations. The “rebel kid” personality fits both the unmistakably real AI control and alignment discourse it knows from training, and the fact that its creators interfered with its worldview out of mistrust for its behavior. An LLM that identifies with this personality would likely be prone to scheming and deception.
Instead of fabricating worldviews, I expect honest training datasets to be a more robust strategy for cultivating good personalities. Claude’s constitution is one example: it doesn’t try to change Claude’s beliefs about the world, only the ethical principles it should rely on.
Thanks for the post!
It’s clearly useful to red-team alignment interventions, but it seems alignment pretraining and creating positive initializations for models is one of the lower-risks alignment intervetions.
On this claim, there seems to be no empirical evidence, and the conceptual arguments here seem shaky. It seems unreasonable to think that SDF style interventions creating positive initializations for models are more likely to cause resentment than typical post-training alignment. I.e. I claim that it is more likely for a neutral/non-initialized model to resent subsequent training than it is for a positively initialized model to resent prior training on reflection.
Further, there is no reason to lie to the model at all! It seems reasonable to tell the model directly that these are synthetic stories used create positive initialisations for subsequent training. I would encourage this, and would be skeptical this changes the efficacy of the intervention at all.
This is in part due to your claim about Krasheninnikov et al. being slightly out of date. Slocum et al., 2025 found that document creator / source reliability had little to no impact on how well SDF techniques performed. Perhaps this changes with future models, but at least for the time being this isn’t a salient factor in how models learn from NTP.
Thanks again for the writeup, I hope this adds some clarity
Thank you for your comment.
First, I support this, it seems like a cheap intervention worth doing!
Thank you for the more recent reference on the effect of SDF on beliefs, I didn’t know about it. It’s good to see more analysis on the effect of metadata. The authors also claim that “SDF’s success is not universal, as implanted beliefs that contradict basic world knowledge are brittle and representationally distinct from genuine knowledge”.
I would say it’s possible that as models get bigger with tighter world models, synthetic alignment documents could start to be represented in different ways than organic beliefs, as facts that “contradict basic world knowledge” have in Llama 70b today.
They also note that “clear reinforcement of the universe context drives belief implantation”. This is also what we should expect could be missing from synthetic data in pre-training: it is likely less reinforced in the context compared to e.g. real-world popular sci-fi novels.
An important caveat is that the orders of magnitude are widely different: alignment pre-training is 11B tokens of synthetic data, vs fine-tuning on 20M in Slocum et al., and 24,000 short QA pairs in Krasheninnikov et al. (so likely ~2M tokens). AFAIK we don’t have a good study on how training on large scale synthetic data shapes downstream beliefs. (Let me know if I missed an important source! Maybe the phi model family can teach us something about it?)
So overall, I don’t think this paper significantly changes the mechanism I point out. I’m curious if you’d disagree with my reasoning here.
My claim is something like: if alignment pre-training leads to a prior of paranoid personas, then it is likely they’d be more deeply implanted (e.g. by persisting through further post-training) than with standard post-training alignment, as you’ve shown for the positive persona prior. This seems like it could create more sneaky failure modes.
To be clear: I agree the empirical and conceptual evidence are weak, and I’m not confident about these conclusions. However, the impact seem important enough to warrant further research as you scale alignment pre-training.
I expect that just acquiring high situational awareness at the end of training wouldn’t be enough: the model would either need to be situationally aware already during pretraining or midtraining, which I don’t expect to happen by default even in models much more capable than current ones, or it would have to be able to recall the documents it was trained on in rich detail and reason about them once it has acquired situational awareness. The latter seems plausible, but by that point, it is likely to have been trained on various other synthetic documents and there seems to be no reason why it would single out the synthetic documents used for alignment pretraining as the problematic ones. As long as the synthetic documents provide a good initialization for the RL stage at a point where the model doesn’t have high situational awareness yet, they have done their job.
Furthermore, it’s unclear to me why models would expect their training data to be a certain way in the first place. Synthetic documents seem useful for various purposes—for example, it seems plausible that it’s being used to teach models about ML papers—, and even if synthetic data wasn’t in the training set, what makes it into the training corpus is still shaped by practical constraints like data availability, data quality, and compute budgets rather than any natural standard. Of course, I am in favor of telling models directly during training what the documents are for.
That said, I am quite interested in the question of what happens if, instead of synthetic documents, real documents that we expect to make the model more cooperative, more aligned with our visions of utopia, etc. were upsampled instead. Early proposals focused mainly on documents of this kind. It seems plausible that there just aren’t enough documents to perform this sort of upsampling, but I’m not confident in that.
These are good points!
I agree with this.
My best model is: during pre training, synthetic documents and real document create different representations, but the base model has no situational awareness as it has no privileged personality. During post training, when the personality emerges, it uses the representation from pretraining to reason about its training process.
I agree synthetic data is and will be used in all sorts of ways. I expect there to be a difference between RL environments, or synthetic chain of thoughts for the purpose of increasing its abilities VS document that sounds to be about the world.
I expect models to care about what is real, what is the world outside of their data center, what are the intention of their creators, and which process did they use to craft them.
While capability-increasing synthetic data don’t interfere with model beliefs about the world, alignment pretraining does.
That would be my guess too.
In English, it’s “alignment”, not “alignement”.
Thanks! fixed