Cam comments on Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment

Cam 21 Dec 2025 18:20 UTC
15 points
1
Thanks Jozdien. To be clear, we don’t expect to recommend filtering pretraining data of AI discourse. Perhaps the most important second order affect of filtering data about misalignment from pretraining is that I would expect the AI epistemics of LLMs to decrease dramatically. This could be harmful for both automating alignment research, and also advising policy makers (and the general public) about the risks of building ASI.

We find upsampling positive data to be much for effective (even for far out of distribution dark triad personality evals). This is why I’m excited about future work looking at how to make [Special Token] training effective. In theory, you should be able to have unfiltered information about AI systems in training, but collapse the model onto a completely unrelated persona. Like Janus’s comment you tagged, I’m excited about pretraining mixes that act as help differentiate their persona AI systems like Sydney or Hal 9000.

You can also imagine dong something like inoculation during [Special Token] training where you give examples of [Special Tokens] going through RL with misspecified rewards, learning to reward hack, but remaining good. You can create tens of thousands of examples of this. I’m excited for future work to look at how altering pretraining data can assist with various generalisation failures. @Daniel Tan has shared some interest in studying how changes in pretraining data can affect generalisation moving forward.
- Jozdien 22 Dec 2025 12:28 UTC
  4 points
  0
  Parent
  That makes sense, and I’m much more supportive of upsampling positive data!
  In theory, you should be able to have unfiltered information about AI systems in training, but collapse the model onto a completely unrelated persona.
  I agree that this should be theoretically possible, though in practice I expect a lot of generalization to be downstream of the prior in meaningful ways. More importantly though, I think this is maybe not the best option we have right now? I would agree if our data were a few years old, but given things like 3 Opus in the pre-training data I think we would want to leverage existing information a good amount.
  I’m excited for future work to look at how altering pretraining data can assist with various generalisation failures. @Daniel Tan has shared some interest in studying how changes in pretraining data can affect generalisation moving forward.
  Same! I think there are tons of low-hanging fruit here. There’s also some prior discussion on using metadata to shape generalization from pre-training in Conditioning Predictive Models.