Simon Lermen comments on Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment

Simon Lermen 22 Dec 2025 15:32 UTC
5 points
−4
I feel like this method will sadly work better at hiding early precursors of misalignment than actual misalignment. A lot of the current examples of misalignment come from models going to some weird part of the training distribution that they have learned inadvertently. Your method probably reduced this.
That being said, a lot of alignment optimists seem to believe that misalignment is some weird phenomenon that might randomly happen, perhaps after the model reads about it somewhere. This is however not where the core danger lies, if you have a coherent agent, it is simply true that taking over from humanity is the best option. There is nothing particularly complicated about this, if you have some preferences or goals, you can achieve those better with more power. On the other hand, trying to get a system that is corrigible or “chill” is a very hard target to hit.
This level of misalignment sadly probably only starts really showing up once the AI really has the option to take over.
I appreciate that you actually try this, since there seem to be a lot of people who say we shouldn’t even talk about misalignment lest the models might read it and randomly decide to become misaligned. If you actually believed this, clearly the answer would be some type of pre-filtering or synthetic data like what you are doing, not silencing all criticism.
- David Africa 23 Dec 2025 9:22 UTC
  11 points
  7
  Parent
  Thanks for the feedback. If by actual misalignment you mean the type that emerges from instrumental convergence, then I agree it is a distinct and massive risk compared to misalignment from roleplaying or personas. I think these type of interventions are useful in the instrumental convergence case for two reasons.
  Ruling out alternative hypotheses for instrumental convergence. Right now, it is difficult to tell if a model is power-seeking because of instrumental convergence or because it is simply predicting what a character does in its training corpus. We can remove the data relevant to such characters, and if the model still exhibits strong power-seeking behavior, then I claim it is much stronger evidence for true instrumental convergence.
  Securing the base substrate against hacking. Even if instrumental convergence is the end-game danger, probably the individual quirks it develops are path dependent, and early data filtering on the persona can help these quirks be good (rather than neutral, or neutral rather than bad, or less bad rather than super awful). And separately, if the base pretraining substrate has a strong alignment prior, we could buy some more time before the model stumbles upon strategies like exploration hacking or gradient hacking during post-training.