Anthropic are now actively using the approach to alignment often called “Alignment Pretraining” or “Safety Pretraining” — using Stochastic Gradient Descent on a large body of natural or synthetic documents showing the AI assistant doing the right thing in morally challenging situations. They tried this out, found it works well and generalizes well, and they’re now using it.
I’m absolutely delighted. I’ve been repeatedly advocating this approach on LessWrong and the Alignment Forum for a couple of years now:
Motivating Alignment of LLM-Powered Agents: Easy for AGI, Hard for ASI?
The Best Way to Align an LLM: Is Inner Alignment Now a Solved Problem?
Motivating Alignment of LLM-Powered Agents: Easy for AGI, Hard for ASI?
Pretraining on Aligned AI Data Dramatically Reduces Misalignment—Even After Post-Training
I’ve been very excited about this alignment technique ever since I read the seminal paper demonstrating that it was extremely effective: Pretraining Language Models with Human Preferences (Korbak et al., ’23). This was later followed up by Safety Pretraining: Toward the Next Generation of Safe AI (Maini, Goyal, Sam et al., ’25), You Are What You Eat—AI Alignment Requires Understanding How Data Shapes Structure and Generalisation (Lehalleur, Hoogland, Farrugia-Roberts et al., ’25), and most recently Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment (Tice, Radmard, et al., ’26).
Others have also posted on LessWrong on this subject, such as TurnTrout’s Self-Fulfilling Misalignment Data Might Be Poisoning Our AI Models,[1] and Beren Millidge’s Alignment In The Age Of Synthetic Data, The case for removing alignment and ML research from the training data and My path to prosaic alignment and open questions. Nostalgebraist discussed something closely related in the void, as did Seth Herd in Broadening the training set for alignment.
Anthropic are also specifically using fiction in which Claude does the right thing in difficult moral situations as training material, so they have even implemented my suggestion of Aligned AI Role-Model Fiction, which was also taken up by Aaron Silverbook/Hyperstition AI: see their posts Silicon Morality Plays: The Hyperstition Progress Report and Special Persona Training: Hyperstition Progress Report 2.
So I’m happy this idea’s time has finally come.
- ^
This post, like some of the other early discussion on this concept on LessWrong, focused primarily on suggesting we should keep examples of AIs behaving badly out of the training set. Anthropic have instead concentrated on the opposite: putting examples of LLMs behaving well into the training set. This is fully consistent with the academic papers on the subject, three of which tied both of these approaches and uniformly found that increasing good examples is much more important than reducing bad examples.
Hello,
Thanks for your post.
I am happy that some people did this research, but I will pushback against some points.
First, inner alignment is not solved even with this results. Because we don’t know if the model really internalised the ethical values or some proxy that correlated within the environment. Maybe the new tool (Natural Language Auto encoders) by Anthropic can help distinguish but I don’t think the tool is precise and reliable enough.
Second, the generalization seems to be weaker than it appears. Basically the alignment generalized between ethicals scenarios and the model act (very) ethical in all scenarios. But the dangerous generalization is the between currents weaks model with limited possibilities of action and far more powerful models in differents environment with capacity of effectively choosing different options for their actions. If the models don’t have the possibility to take different actions you can’t see that is internalized from outside.
Third, better proxies are still proxies. I agree that we can likely obtain system with better proxies aligned utilities fonctions using carefully selected data for training. But I struggle to understand why we will obtain anything others than proxies with this type of training. (Like for evolution which has not directly written “reproduce and passe your genes” inside my head but proxies instead).
Reducing blackmail to zero for current environment seems nice but I feel that we must be careful on what exactly is solved and about the difficulty still ahead of us.
(for admins: while I’m interested enough to look at this, on one view it’s essentially ‘news’ and I’m curious how you’re relating this to the timelessness desideratum for front-page posts)
I think it’s a good survey regardless
It was intended to be “Anthropic finally picked up and ran with the idea I and others have been pushing for years, so the world is a safer place, yay!” (with a side order of “…and I claim my Bayes points”). I suppose that’s somewhere between history and news. If you want a detailed survey, see Pretraining on Aligned AI Data Dramatically Reduces Misalignment—Even After Post-Training, especially the section How We Got Here.