RogerDearnaley comments on Pretraining on Aligned AI Data Dramatically Reduces Misalignment—Even After Post-Training

RogerDearnaley 21 Jan 2026 1:12 UTC
3 points
1
I don’t think the paper has anything to do with a randomly initialized transformer — I think it’s about the priors a base model learns from the training data, about 1001 personas from witch to angel to dentist to aligned AI to paperclip-maximizer. What the paper shows is that the ratio of the last two AI-related priors can be adjusted by raising or lowering the amount of data about AI behaving badly, or by raising the amount of data about AI behaving well — but the base on the latter is low, so it’s easier to raise that dramatically than it is to filter out a large proportion of the AI-acting-badly stuff. Also that fully adjusting those priors takes a while — a quick finetune with a small amount of data at a high learning rate has a more superficial/less thorough effect than using a lot more data during midtrainig, and that’s still not as good as using even more data all through pretraining.
- David Johnston 21 Jan 2026 9:33 UTC
  7 points
  2
  Parent
  I was responding to Gurkenglas’ comment as I understood it, I agree your paper is not about this.
  - RogerDearnaley 21 Jan 2026 15:08 UTC
    2 points
    0
    Parent
    Thanks, I had misunderstood Gurkenglas — I’m not used to thinking of a randomly intialized model as a bag of priors rather than a random starting point in a very high dimensional space or an incholate mess, but yes, under the analogy to Bayesian Inference it’s actually some sort of statistical approximation to a uniform prior (with, CLT informs us, a simplicity bias that approximates the Solomonoff one).