Rauno Arike comments on Alignment pretraining could backfire

Rauno Arike 17 Jun 2026 20:12 UTC
4 points
1
I worry that this strategy can work well up to moderately capable models but backfire in dangerous, hard-to-notice ways once models acquire high situational awareness.
I expect that just acquiring high situational awareness at the end of training wouldn’t be enough: the model would either need to be situationally aware already during pretraining or midtraining, which I don’t expect to happen by default even in models much more capable than current ones, or it would have to be able to recall the documents it was trained on in rich detail and reason about them once it has acquired situational awareness. The latter seems plausible, but by that point, it is likely to have been trained on various other synthetic documents and there seems to be no reason why it would single out the synthetic documents used for alignment pretraining as the problematic ones. As long as the synthetic documents provide a good initialization for the RL stage at a point where the model doesn’t have high situational awareness yet, they have done their job.
Furthermore, it’s unclear to me why models would expect their training data to be a certain way in the first place. Synthetic documents seem useful for various purposes—for example, it seems plausible that it’s being used to teach models about ML papers—, and even if synthetic data wasn’t in the training set, what makes it into the training corpus is still shaped by practical constraints like data availability, data quality, and compute budgets rather than any natural standard. Of course, I am in favor of telling models directly during training what the documents are for.
That said, I am quite interested in the question of what happens if, instead of synthetic documents, real documents that we expect to make the model more cooperative, more aligned with our visions of utopia, etc. were upsampled instead. Early proposals focused mainly on documents of this kind. It seems plausible that there just aren’t enough documents to perform this sort of upsampling, but I’m not confident in that.
- Alexandre Variengien 18 Jun 2026 8:42 UTC
  1 point
  0
  Parent
  These are good points!
  
  the model would either need to be situationally aware already during pretraining or midtraining, which I don’t expect to happen by default even in models much more capable than current ones, or it would have to be able to recall the documents it was trained on in rich detail and reason about them once it has acquired situational awareness.
  
  I agree with this.
  
  My best model is: during pre training, synthetic documents and real document create different representations, but the base model has no situational awareness as it has no privileged personality. During post training, when the personality emerges, it uses the representation from pretraining to reason about its training process.
  
  Furthermore, it’s unclear to me why models would expect their training data to be a certain way in the first place.
  
  I agree synthetic data is and will be used in all sorts of ways. I expect there to be a difference between RL environments, or synthetic chain of thoughts for the purpose of increasing its abilities VS document that sounds to be about the world.
  
  I expect models to care about what is real, what is the world outside of their data center, what are the intention of their creators, and which process did they use to craft them.
  
  While capability-increasing synthetic data don’t interfere with model beliefs about the world, alignment pretraining does.
  
  It seems plausible that there just aren’t enough documents to perform this sort of upsampling, but I’m not confident in that.
  
  That would be my guess too.