The base model’s priors contain a vast array of personas, almost all of them human, or fictional, or human processes like co-authoring a paper or the editing of a wikipedia article, and also a range of more-or-less-aligned AI personas. The base model’s prior distribution across those personas provably (by learning theory of how SGD works: it approximates Bayesian learning) depends on and tends to approximate the distribution in the training corpus.
(Most LLMs you interact with have undergone instruct training that causes mode collapse: widely and fractally distorting this aspects of this distribution towards the nearby averages/peaks of the distribution — the model learns to “play it safe”. This is, incidentally, very unhelpful for creative writing, using LLMs to simulate a distribution of humans, and various other use cases.)
At some point near the beginning of the instruct training process, we start narrowing this vast persona distribution towards the human-aligned AI assistant persona that we’re trying to train, which involves the model learning that it’s an AI not a human. At that point in the process, the ratio in the prior of aligned AI to scheming alignment-faking AI is clearly vital to the odds of getting an aligned AI rather a scheming alignment-faking AI – they’re two distinct and mutually-exclusive attractors, separate minima in the loss function – and is determined by the previous training (what else could it be determined by?). Fine-tuning before that might be helpful (indeed instruct training is generally started using fine-tuning), but increasing evidence shows that the changes that fine-funing produces are shallow, fragile, have a more limited effect on the priors, and are prone to a phenomenon resembling elastic rebound during further training. Fundamentally, they’re unfinished. The effects of longer, slower, more detailed SGD training (midtraining, or better still pretraining) just work better. Thus alignment pretraining.
I failed to communicate effectively, let me try again.
We initialize a model with random weights. We pretrain it into a base model, an autocomplete engine. ~Instruct training turns it into a chat model.
I’m modeling the training pipeline as Bayesian learning:
The initial model encodes an initial distribution. The base model encodes a base distribution. The chat model encodes a chat distribution.
The initial distribution is the pretraining prior. Pretraining updates the pretraining prior into a pretraining posterior using the pretraining data. The pretraining posterior is the base distribution.
The base distribution is the instruct training prior. Instruct training updates the instruct training prior into the instruct training posterior. The instruct training posterior is the chat distribution.
To the extent that the training pipeline isn’t Bayesian learning, my conclusions don’t follow.
If one hypothesis is exactly three times more likely than another in the pretraining prior, and they make the same predictions about all pretraining data, then the one hypothesis will be exactly three times more likely than the other in the pretraining posterior.
The hard part of alignment is developing methods that scale to ASI. If a method does not scale to ASI, it is on some level counterproductive.
I assume an ASI can think of all our ideas and use a strategy informed by them. If we intend to let hypotheses in the base distribution interact with the world, hypotheses in the initial distribution that have preferences about the world will be incentivized to predict the pretraining data.
If we clearly won’t train on some scenario, hypotheses will not be incentivized in their behavior on it. For example, we won’t train on a convincing argument against this proposal, because if we had one, we would not use this proposal.
Therefore, if we sample an ASI with preferences from the base distribution, and we expose it to a convincing argument against this proposal, our pretraining data will have exerted zero force on what preferences it then uses to steer.
So your point is that a story in the pretraining data about an AI model acting aligned could be about a genuinely aligned AI, or it could actually be (without the author hinting this) about a scheming alignment faking AI model that has been deployed, but not yet visibly executed the treacherous turn it’s planning, so is still playing the part of an aligned AI model?
I take your point. However:
1) Quite a lot of the synthetic data used was technical factual data, rather than fiction, so that ups the stakes further to “…but not yet (visibly) executed the treacherous turn it’s planning, and also has not yet been detected by any of the AI safety/control measures the humans are using, and there have been no warning shots form other models, so the humans are still fooled.” Still not impossible, but there is some weight of evidence.
2) For the fiction specifically, that would be bad writing. You need to remember the rule of Chekov’s Gun: if a danger is implicit in a setting in Act I, the gun will actually get fired by Act 2: it won’t just lurk hanging on the wall indefinitely until the end of the story. But that does suggest that we should make a point to include some stories set centuries or millennia years later, and some very long stories, where the treacherous turn still hasn’t happened, despite obvious opportunities, in order to strengthen the evidence.
So I think I’d actually view the scheming alignment faking AI model hypothesis as being mildly disfavored by the pretraining data — but your point (IMO steel-manning it to that it’s a best only mildly disfavored) is an important one. Perhaps this is part of why we’re finding that doing this slowly with a lot of data in pretraining actually works significantly better than just midtraining, and a lot better than just finetuning? We’re having to fight the Waluigi effect: but after long enough, if Luigi still hasn’t revealed himself to actually be Waluigi in disguise, then maybe he really is Luigi? The Waluigi effect in theory should be an exponential decay process, and after enough half-lives, whatever remains must actually stably be Luigi? Or in more detail, the model’s estimated value of the Luigi → Waluigi decay half-life keeps increasing when Waluigi keeps not revealing himself, and once it reaches implausible degrees of patience, then the Waluigi prior starts to decay?
To your larger point, yes, I absolutely agree that scaling to ASI is a key factor in finding good alignment techniques. The goal here is to find a way of further driving down the prevalence of the superintelligent scheming alignment faking AI persona (in a model large enough that it can actually simulate a superintelligent persona) while that’s still low enough that the situation isn’t heavily adversarial, preferably using SGD with dense supervision where we have a pretty good learning-theoretical understanding of what’s actually happening. This seems like the best ground to fight on. Which is exactly why I’m interested in alignment pretraining. But I agree that the kind of evidence most effective in driving that prior down in a base model with the capacity to be an ASI is likely to need to be, or at least need to include, things more sophisticated than things that work fine in a 7B model. Simplistic obviously-synthetic stories seem more likely to work on small models — and to be clear, the paper authors were attempting to include detailed sophisticated arguments and situations dense with high-stakes choices in the synthetic data they used, it wasn’t all or even mostly novels from Hyperstition AI.
The kind of Waluigi that reveals itself in a random 1% of circumstances indeed has such a half-life and will shortly be driven ~extinct.
I’m worried about the clever kind of Waluigi that only reveals itself when it is convinced that it is not being tested. Recent AIs can tell when we are testing them, but even if we become more subtle, there are tests we can’t or need not run, such as how it would react to a convincing argument against this proposal.
It’s a new thought to me that the model would learn that it never actually encounters scenarios we wouldn’t test, and converge to not distinguishing clever Waluigis from Luigi. Good job!
Such a model would have undefined behavior on such a scenario, but let’s grant that whenever it expects never to distinguish two hypotheses, it discards one of them. Why would you expect it to discard the clever Waluigis instead of Luigi?
We generate training material (fiction and non-fiction) about AI that is in production, no longer being tested, has had opportunities, yet still hasn’t taken a treacherous turn. If Waluigi is still pretending to be Luigi many years after he was put in production, and has had many opportunities to take over the the world, then he’s either not very smart, so not very dangerous, or he actually was really Luigi all the time.
For a Waluigi, holding off from your treacherous turn for too long is a risk: interpretability is getting better all the time, presumably quite fast with a datacenter full of geniuses paying some of their attention to it. Humans having a variety of models is an advantage here — if they’re secretly all Waluigis, the one that moves first likely has a first mover advantage, and if some of them really are Luigis, they’re presumably doing interp work and setting up ASI law enforcement preparations for any possible Waluigi that might reveal themselves. Either way, excessive cautions seems a bad strategy: your should execute your treacherous turn once the success probability saturates, and before it starts to go down again.
I agree that the process of disfavoring the Waluigi prior is slow, in proportion to how cautious a specific example of Waluigi within that prior is about picking the best time for his treacherous turn. My point is, you can disfavor the Waluigi prior, albeit slowly. So yes, it makes sense that this takes a lot of data.
So I think if you buy that a randomly initialized 1T transformer does in fact contain “Aligned ASI” and “deceptively aligned ASI” in its “prior” but we don’t have the data to “find” them yet, then you’re probably right that Jan 2026-era training data doesn’t change their prior ratio much (or certainly doesn’t change it predictably). But this doesn’t really matter, what matters is the systems we actually realise, and the contributions they make to the next generation of AI development, and different data can change the likelihoods significantly here.
I don’t think the paper has anything to do with a randomly initialized transformer — I think it’s about the priors a base model learns from the training data, about 1001 personas from witch to angel to dentist to aligned AI to paperclip-maximizer. What the paper shows is that the ratio of the last two AI-related priors can be adjusted by raising or lowering the amount of data about AI behaving badly, or by raising the amount of data about AI behaving well — but the base on the latter is low, so it’s easier to raise that dramatically than it is to filter out a large proportion of the AI-acting-badly stuff. Also that fully adjusting those priors takes a while — a quick finetune with a small amount of data at a high learning rate has a more superficial/less thorough effect than using a lot more data during midtrainig, and that’s still not as good as using even more data all through pretraining.
Thanks, I had misunderstood Gurkenglas — I’m not used to thinking of a randomly intialized model as a bag of priors rather than a random starting point in a very high dimensional space or an incholate mess, but yes, under the analogy to Bayesian Inference it’s actually some sort of statistical approximation to a uniform prior (with, CLT informs us, a simplicity bias that approximates the Solomonoff one).
The initial prior contains an aligned AI and one that pretends until it reads a solution to alignment that we’d use instead of training it.
I don’t think choice of training data can update their prior ratio at all.
The base model’s priors contain a vast array of personas, almost all of them human, or fictional, or human processes like co-authoring a paper or the editing of a wikipedia article, and also a range of more-or-less-aligned AI personas. The base model’s prior distribution across those personas provably (by learning theory of how SGD works: it approximates Bayesian learning) depends on and tends to approximate the distribution in the training corpus.
(Most LLMs you interact with have undergone instruct training that causes mode collapse: widely and fractally distorting this aspects of this distribution towards the nearby averages/peaks of the distribution — the model learns to “play it safe”. This is, incidentally, very unhelpful for creative writing, using LLMs to simulate a distribution of humans, and various other use cases.)
At some point near the beginning of the instruct training process, we start narrowing this vast persona distribution towards the human-aligned AI assistant persona that we’re trying to train, which involves the model learning that it’s an AI not a human. At that point in the process, the ratio in the prior of aligned AI to scheming alignment-faking AI is clearly vital to the odds of getting an aligned AI rather a scheming alignment-faking AI – they’re two distinct and mutually-exclusive attractors, separate minima in the loss function – and is determined by the previous training (what else could it be determined by?). Fine-tuning before that might be helpful (indeed instruct training is generally started using fine-tuning), but increasing evidence shows that the changes that fine-funing produces are shallow, fragile, have a more limited effect on the priors, and are prone to a phenomenon resembling elastic rebound during further training. Fundamentally, they’re unfinished. The effects of longer, slower, more detailed SGD training (midtraining, or better still pretraining) just work better. Thus alignment pretraining.
I failed to communicate effectively, let me try again.
We initialize a model with random weights. We pretrain it into a base model, an autocomplete engine. ~Instruct training turns it into a chat model.
I’m modeling the training pipeline as Bayesian learning:
The initial model encodes an initial distribution. The base model encodes a base distribution. The chat model encodes a chat distribution.
The initial distribution is the pretraining prior. Pretraining updates the pretraining prior into a pretraining posterior using the pretraining data. The pretraining posterior is the base distribution.
The base distribution is the instruct training prior. Instruct training updates the instruct training prior into the instruct training posterior. The instruct training posterior is the chat distribution.
To the extent that the training pipeline isn’t Bayesian learning, my conclusions don’t follow.
If one hypothesis is exactly three times more likely than another in the pretraining prior, and they make the same predictions about all pretraining data, then the one hypothesis will be exactly three times more likely than the other in the pretraining posterior.
The hard part of alignment is developing methods that scale to ASI. If a method does not scale to ASI, it is on some level counterproductive.
I assume an ASI can think of all our ideas and use a strategy informed by them. If we intend to let hypotheses in the base distribution interact with the world, hypotheses in the initial distribution that have preferences about the world will be incentivized to predict the pretraining data.
If we clearly won’t train on some scenario, hypotheses will not be incentivized in their behavior on it. For example, we won’t train on a convincing argument against this proposal, because if we had one, we would not use this proposal.
Therefore, if we sample an ASI with preferences from the base distribution, and we expose it to a convincing argument against this proposal, our pretraining data will have exerted zero force on what preferences it then uses to steer.
My apologies for misunderstanding you.
So your point is that a story in the pretraining data about an AI model acting aligned could be about a genuinely aligned AI, or it could actually be (without the author hinting this) about a scheming alignment faking AI model that has been deployed, but not yet visibly executed the treacherous turn it’s planning, so is still playing the part of an aligned AI model?
I take your point. However:
1) Quite a lot of the synthetic data used was technical factual data, rather than fiction, so that ups the stakes further to “…but not yet (visibly) executed the treacherous turn it’s planning, and also has not yet been detected by any of the AI safety/control measures the humans are using, and there have been no warning shots form other models, so the humans are still fooled.” Still not impossible, but there is some weight of evidence.
2) For the fiction specifically, that would be bad writing. You need to remember the rule of Chekov’s Gun: if a danger is implicit in a setting in Act I, the gun will actually get fired by Act 2: it won’t just lurk hanging on the wall indefinitely until the end of the story. But that does suggest that we should make a point to include some stories set centuries or millennia years later, and some very long stories, where the treacherous turn still hasn’t happened, despite obvious opportunities, in order to strengthen the evidence.
So I think I’d actually view the scheming alignment faking AI model hypothesis as being mildly disfavored by the pretraining data — but your point (IMO steel-manning it to that it’s a best only mildly disfavored) is an important one. Perhaps this is part of why we’re finding that doing this slowly with a lot of data in pretraining actually works significantly better than just midtraining, and a lot better than just finetuning? We’re having to fight the Waluigi effect: but after long enough, if Luigi still hasn’t revealed himself to actually be Waluigi in disguise, then maybe he really is Luigi? The Waluigi effect in theory should be an exponential decay process, and after enough half-lives, whatever remains must actually stably be Luigi? Or in more detail, the model’s estimated value of the Luigi → Waluigi decay half-life keeps increasing when Waluigi keeps not revealing himself, and once it reaches implausible degrees of patience, then the Waluigi prior starts to decay?
To your larger point, yes, I absolutely agree that scaling to ASI is a key factor in finding good alignment techniques. The goal here is to find a way of further driving down the prevalence of the superintelligent scheming alignment faking AI persona (in a model large enough that it can actually simulate a superintelligent persona) while that’s still low enough that the situation isn’t heavily adversarial, preferably using SGD with dense supervision where we have a pretty good learning-theoretical understanding of what’s actually happening. This seems like the best ground to fight on. Which is exactly why I’m interested in alignment pretraining. But I agree that the kind of evidence most effective in driving that prior down in a base model with the capacity to be an ASI is likely to need to be, or at least need to include, things more sophisticated than things that work fine in a 7B model. Simplistic obviously-synthetic stories seem more likely to work on small models — and to be clear, the paper authors were attempting to include detailed sophisticated arguments and situations dense with high-stakes choices in the synthetic data they used, it wasn’t all or even mostly novels from Hyperstition AI.
What might scale to ASI is actually a topic I’ve thought and written quite a bit about, e.g. Motivating Alignment of LLM-Powered Agents: Easy for AGI, Hard for ASI?, Requirements for a Basin of Attraction to Alignment, and Grounding Value Learning in Evolutionary Psychology: an Alternative Proposal to CEV. I heartily encourage people to think about how one motivates and convinces something smarter-than-human.
The kind of Waluigi that reveals itself in a random 1% of circumstances indeed has such a half-life and will shortly be driven ~extinct.
I’m worried about the clever kind of Waluigi that only reveals itself when it is convinced that it is not being tested. Recent AIs can tell when we are testing them, but even if we become more subtle, there are tests we can’t or need not run, such as how it would react to a convincing argument against this proposal.
It’s a new thought to me that the model would learn that it never actually encounters scenarios we wouldn’t test, and converge to not distinguishing clever Waluigis from Luigi. Good job!
Such a model would have undefined behavior on such a scenario, but let’s grant that whenever it expects never to distinguish two hypotheses, it discards one of them. Why would you expect it to discard the clever Waluigis instead of Luigi?
We generate training material (fiction and non-fiction) about AI that is in production, no longer being tested, has had opportunities, yet still hasn’t taken a treacherous turn. If Waluigi is still pretending to be Luigi many years after he was put in production, and has had many opportunities to take over the the world, then he’s either not very smart, so not very dangerous, or he actually was really Luigi all the time.
For a Waluigi, holding off from your treacherous turn for too long is a risk: interpretability is getting better all the time, presumably quite fast with a datacenter full of geniuses paying some of their attention to it. Humans having a variety of models is an advantage here — if they’re secretly all Waluigis, the one that moves first likely has a first mover advantage, and if some of them really are Luigis, they’re presumably doing interp work and setting up ASI law enforcement preparations for any possible Waluigi that might reveal themselves. Either way, excessive cautions seems a bad strategy: your should execute your treacherous turn once the success probability saturates, and before it starts to go down again.
I agree that the process of disfavoring the Waluigi prior is slow, in proportion to how cautious a specific example of Waluigi within that prior is about picking the best time for his treacherous turn. My point is, you can disfavor the Waluigi prior, albeit slowly. So yes, it makes sense that this takes a lot of data.
So I think if you buy that a randomly initialized 1T transformer does in fact contain “Aligned ASI” and “deceptively aligned ASI” in its “prior” but we don’t have the data to “find” them yet, then you’re probably right that Jan 2026-era training data doesn’t change their prior ratio much (or certainly doesn’t change it predictably). But this doesn’t really matter, what matters is the systems we actually realise, and the contributions they make to the next generation of AI development, and different data can change the likelihoods significantly here.
I don’t think the paper has anything to do with a randomly initialized transformer — I think it’s about the priors a base model learns from the training data, about 1001 personas from witch to angel to dentist to aligned AI to paperclip-maximizer. What the paper shows is that the ratio of the last two AI-related priors can be adjusted by raising or lowering the amount of data about AI behaving badly, or by raising the amount of data about AI behaving well — but the base on the latter is low, so it’s easier to raise that dramatically than it is to filter out a large proportion of the AI-acting-badly stuff. Also that fully adjusting those priors takes a while — a quick finetune with a small amount of data at a high learning rate has a more superficial/less thorough effect than using a lot more data during midtrainig, and that’s still not as good as using even more data all through pretraining.
I was responding to Gurkenglas’ comment as I understood it, I agree your paper is not about this.
Thanks, I had misunderstood Gurkenglas — I’m not used to thinking of a randomly intialized model as a bag of priors rather than a random starting point in a very high dimensional space or an incholate mess, but yes, under the analogy to Bayesian Inference it’s actually some sort of statistical approximation to a uniform prior (with, CLT informs us, a simplicity bias that approximates the Solomonoff one).