I would be interested to see how much upsampling aligned text in pretraining generalizes to aligned behaviour out of distribution relative to the pretraining corpus. I have the suspicion that the ‘textbook questions’ used as an eval here might end up looking pretty similar to scenarios described in the article-based training data.
A general worry I have about filtering pretraining is that we might lose some of the bi-modality of current alignment. Current models are aligned in a brittle way because their pretraining gives them a strong prior of producing ‘evil text’. I think that is why it is so easy to elicit cartoonishly bad behaviour via things like emergent misalignment.
On the one hand, of course it is bad if our alignment methods are brittle. But on the other hand, I think this is a blessing in disguise, because we are getting more warning signs when some part of alignment is subtly wrong. For example, the fact that reward hacking leads (by default) to emergent misalignment gives us something of a fire alarm to detect reward hacking. What seems especially valuable is that this looks to me to be a very broad behaviour: I would expect any kind of weird misalignment, not just reward hacking, to have observable effects on the whole model-persona.
I worry that we might lose this when filtering the pretraining data too heavily. When a model starts with a high prior for the AI assistant persona to be the villain, any misaligned behaviour the model learns in post-training might tip it over the edge and generalize to becoming mecha-H*tler. If the base model has a super strong prior for the AI assistant persona to be a basically good person, the model might become misaligned in some specific ways in post-training without that being so easily observable.
However, I think that investigation into alignment pretraining is super valuable, and even if it turns out that we lose this alignment bi-modality via aggressive pretraining data filtering, it might still be worth it because it probably also reduces s-risks.
I love this paper!
Some thoughts:
I would be interested to see how much upsampling aligned text in pretraining generalizes to aligned behaviour out of distribution relative to the pretraining corpus. I have the suspicion that the ‘textbook questions’ used as an eval here might end up looking pretty similar to scenarios described in the article-based training data.
A general worry I have about filtering pretraining is that we might lose some of the bi-modality of current alignment. Current models are aligned in a brittle way because their pretraining gives them a strong prior of producing ‘evil text’. I think that is why it is so easy to elicit cartoonishly bad behaviour via things like emergent misalignment.
On the one hand, of course it is bad if our alignment methods are brittle. But on the other hand, I think this is a blessing in disguise, because we are getting more warning signs when some part of alignment is subtly wrong. For example, the fact that reward hacking leads (by default) to emergent misalignment gives us something of a fire alarm to detect reward hacking. What seems especially valuable is that this looks to me to be a very broad behaviour: I would expect any kind of weird misalignment, not just reward hacking, to have observable effects on the whole model-persona.
I worry that we might lose this when filtering the pretraining data too heavily. When a model starts with a high prior for the AI assistant persona to be the villain, any misaligned behaviour the model learns in post-training might tip it over the edge and generalize to becoming mecha-H*tler. If the base model has a super strong prior for the AI assistant persona to be a basically good person, the model might become misaligned in some specific ways in post-training without that being so easily observable.
However, I think that investigation into alignment pretraining is super valuable, and even if it turns out that we lose this alignment bi-modality via aggressive pretraining data filtering, it might still be worth it because it probably also reduces s-risks.