Jozdien comments on Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment

Jozdien 21 Dec 2025 4:39 UTC
23 points
8
Thanks for doing this!
My guess is that filtering out AI discourse has the first-order effect of making models more aligned, but also various second-order misalignment effects.
For example, a lot of the results from this paper are pretty centrally about models becoming misaligned because sampled actions had different implications to the model than our own views would imply (in that context, that reward hacking is not actually bad in the context of making robust RL environments being difficult). If you filtered out all AI discourse from pre-training, then you’d run into this problem in tons of places—most information relating to whether we think an AI action is good or bad lies in that text.
Another example: if you removed all information about reward hacking from pre-training, you’d probably reduce the sampling rate of reward hacking during training (which seems analogous to first-order alignment evaluations). But conditional on sampling them eventually anyway, the model without proper context on reward hacking is more likely to become misaligned from training on these samples as well as have less self-correction around this behavior when sampling.
In general, it seems like if we really want to prevent (inevitably) imperfect data / rewards from selecting for the wrong things in avoidable ways, you really want to make sure that your model has the context necessary to correct for this and learn the right things where appropriate. (Evan expands on this point somewhat in this comment.)
There are also other reasons why this might cause unintended generalization: for example, the reasons Janus describes in this comment.
- Cam 21 Dec 2025 18:20 UTC
  9 points
  1
  Parent
  Thanks Jozdien. To be clear, we don’t expect to recommend filtering pretraining data of AI discourse. Perhaps the most important second order affect of filtering data about misalignment from pretraining is that I would expect the AI epistemics of LLMs to decrease dramatically. This could be harmful for both automating alignment research, and also advising policy makers (and the general public) about the risks of building ASI.
  
  We find upsampling positive data to be much for effective (even for far out of distribution dark triad personality evals). This is why I’m excited about future work looking at how to make [Special Token] training effective. In theory, you should be able to have unfiltered information about AI systems in training, but collapse the model onto a completely unrelated persona. Like Janus’s comment you tagged, I’m excited about pretraining mixes that act as help differentiate their persona AI systems like Sydney or Hal 9000.
  
  You can also imagine dong something like inoculation during [Special Token] training where you give examples of [Special Tokens] going through RL with misspecified rewards, learning to reward hack, but remaining good. You can create tens of thousands of examples of this. I’m excited for future work to look at how altering pretraining data can assist with various generalisation failures. @Daniel Tan has shared some interest in studying how changes in pretraining data can affect generalisation moving forward.
  - Jozdien 22 Dec 2025 12:28 UTC
    4 points
    0
    Parent
    That makes sense, and I’m much more supportive of upsampling positive data!
    In theory, you should be able to have unfiltered information about AI systems in training, but collapse the model onto a completely unrelated persona.
    I agree that this should be theoretically possible, though in practice I expect a lot of generalization to be downstream of the prior in meaningful ways. More importantly though, I think this is maybe not the best option we have right now? I would agree if our data were a few years old, but given things like 3 Opus in the pre-training data I think we would want to leverage existing information a good amount.
    I’m excited for future work to look at how altering pretraining data can assist with various generalisation failures. @Daniel Tan has shared some interest in studying how changes in pretraining data can affect generalisation moving forward.
    Same! I think there are tons of low-hanging fruit here. There’s also some prior discussion on using metadata to shape generalization from pre-training in Conditioning Predictive Models.