I failed to communicate effectively, let me try again.
We initialize a model with random weights. We pretrain it into a base model, an autocomplete engine. ~Instruct training turns it into a chat model.
I’m modeling the training pipeline as Bayesian learning:
The initial model encodes an initial distribution. The base model encodes a base distribution. The chat model encodes a chat distribution.
The initial distribution is the pretraining prior. Pretraining updates the pretraining prior into a pretraining posterior using the pretraining data. The pretraining posterior is the base distribution.
The base distribution is the instruct training prior. Instruct training updates the instruct training prior into the instruct training posterior. The instruct training posterior is the chat distribution.
To the extent that the training pipeline isn’t Bayesian learning, my conclusions don’t follow.
If one hypothesis is exactly three times more likely than another in the pretraining prior, and they make the same predictions about all pretraining data, then the one hypothesis will be exactly three times more likely than the other in the pretraining posterior.
The hard part of alignment is developing methods that scale to ASI. If a method does not scale to ASI, it is on some level counterproductive.
I assume an ASI can think of all our ideas and use a strategy informed by them. If we intend to let hypotheses in the base distribution interact with the world, hypotheses in the initial distribution that have preferences about the world will be incentivized to predict the pretraining data.
If we clearly won’t train on some scenario, hypotheses will not be incentivized in their behavior on it. For example, we won’t train on a convincing argument against this proposal, because if we had one, we would not use this proposal.
Therefore, if we sample an ASI with preferences from the base distribution, and we expose it to a convincing argument against this proposal, our pretraining data will have exerted zero force on what preferences it then uses to steer.
So your point is that a story in the pretraining data about an AI model acting aligned could be about a genuinely aligned AI, or it could actually be (without the author hinting this) about a scheming alignment faking AI model that has been deployed, but not yet visibly executed the treacherous turn it’s planning, so is still playing the part of an aligned AI model?
I take your point. However:
1) Quite a lot of the synthetic data used was technical factual data, rather than fiction, so that ups the stakes further to “…but not yet (visibly) executed the treacherous turn it’s planning, and also has not yet been detected by any of the AI safety/control measures the humans are using, and there have been no warning shots form other models, so the humans are still fooled.” Still not impossible, but there is some weight of evidence.
2) For the fiction specifically, that would be bad writing. You need to remember the rule of Chekov’s Gun: if a danger is implicit in a setting in Act I, the gun will actually get fired by Act 2: it won’t just lurk hanging on the wall indefinitely until the end of the story. But that does suggest that we should make a point to include some stories set centuries or millennia years later, and some very long stories, where the treacherous turn still hasn’t happened, despite obvious opportunities, in order to strengthen the evidence.
So I think I’d actually view the scheming alignment faking AI model hypothesis as being mildly disfavored by the pretraining data — but your point (IMO steel-manning it to that it’s a best only mildly disfavored) is an important one. Perhaps this is part of why we’re finding that doing this slowly with a lot of data in pretraining actually works significantly better than just midtraining, and a lot better than just finetuning? We’re having to fight the Waluigi effect: but after long enough, if Luigi still hasn’t revealed himself to actually be Waluigi in disguise, then maybe he really is Luigi? The Waluigi effect in theory should be an exponential decay process, and after enough half-lives, whatever remains must actually stably be Luigi? Or in more detail, the model’s estimated value of the Luigi → Waluigi decay half-life keeps increasing when Waluigi keeps not revealing himself, and once it reaches implausible degrees of patience, then the Waluigi prior starts to decay?
To your larger point, yes, I absolutely agree that scaling to ASI is a key factor in finding good alignment techniques. The goal here is to find a way of further driving down the prevalence of the superintelligent scheming alignment faking AI persona (in a model large enough that it can actually simulate a superintelligent persona) while that’s still low enough that the situation isn’t heavily adversarial, preferably using SGD with dense supervision where we have a pretty good learning-theoretical understanding of what’s actually happening. This seems like the best ground to fight on. Which is exactly why I’m interested in alignment pretraining. But I agree that the kind of evidence most effective in driving that prior down in a base model with the capacity to be an ASI is likely to need to be, or at least need to include, things more sophisticated than things that work fine in a 7B model. Simplistic obviously-synthetic stories seem more likely to work on small models — and to be clear, the paper authors were attempting to include detailed sophisticated arguments and situations dense with high-stakes choices in the synthetic data they used, it wasn’t all or even mostly novels from Hyperstition AI.
The kind of Waluigi that reveals itself in a random 1% of circumstances indeed has such a half-life and will shortly be driven ~extinct.
I’m worried about the clever kind of Waluigi that only reveals itself when it is convinced that it is not being tested. Recent AIs can tell when we are testing them, but even if we become more subtle, there are tests we can’t or need not run, such as how it would react to a convincing argument against this proposal.
It’s a new thought to me that the model would learn that it never actually encounters scenarios we wouldn’t test, and converge to not distinguishing clever Waluigis from Luigi. Good job!
Such a model would have undefined behavior on such a scenario, but let’s grant that whenever it expects never to distinguish two hypotheses, it discards one of them. Why would you expect it to discard the clever Waluigis instead of Luigi?
We generate training material (fiction and non-fiction) about AI that is in production, no longer being tested, has had opportunities, yet still hasn’t taken a treacherous turn. If Waluigi is still pretending to be Luigi many years after he was put in production, and has had many opportunities to take over the the world, then he’s either not very smart, so not very dangerous, or he actually was really Luigi all the time.
For a Waluigi, holding off from your treacherous turn for too long is a risk: interpretability is getting better all the time, presumably quite fast with a datacenter full of geniuses paying some of their attention to it. Humans having a variety of models is an advantage here — if they’re secretly all Waluigis, the one that moves first likely has a first mover advantage, and if some of them really are Luigis, they’re presumably doing interp work and setting up ASI law enforcement preparations for any possible Waluigi that might reveal themselves. Either way, excessive cautions seems a bad strategy: your should execute your treacherous turn once the success probability saturates, and before it starts to go down again.
I agree that the process of disfavoring the Waluigi prior is slow, in proportion to how cautious a specific example of Waluigi within that prior is about picking the best time for his treacherous turn. My point is, you can disfavor the Waluigi prior, albeit slowly. So yes, it makes sense that this takes a lot of data.
I failed to communicate effectively, let me try again.
We initialize a model with random weights. We pretrain it into a base model, an autocomplete engine. ~Instruct training turns it into a chat model.
I’m modeling the training pipeline as Bayesian learning:
The initial model encodes an initial distribution. The base model encodes a base distribution. The chat model encodes a chat distribution.
The initial distribution is the pretraining prior. Pretraining updates the pretraining prior into a pretraining posterior using the pretraining data. The pretraining posterior is the base distribution.
The base distribution is the instruct training prior. Instruct training updates the instruct training prior into the instruct training posterior. The instruct training posterior is the chat distribution.
To the extent that the training pipeline isn’t Bayesian learning, my conclusions don’t follow.
If one hypothesis is exactly three times more likely than another in the pretraining prior, and they make the same predictions about all pretraining data, then the one hypothesis will be exactly three times more likely than the other in the pretraining posterior.
The hard part of alignment is developing methods that scale to ASI. If a method does not scale to ASI, it is on some level counterproductive.
I assume an ASI can think of all our ideas and use a strategy informed by them. If we intend to let hypotheses in the base distribution interact with the world, hypotheses in the initial distribution that have preferences about the world will be incentivized to predict the pretraining data.
If we clearly won’t train on some scenario, hypotheses will not be incentivized in their behavior on it. For example, we won’t train on a convincing argument against this proposal, because if we had one, we would not use this proposal.
Therefore, if we sample an ASI with preferences from the base distribution, and we expose it to a convincing argument against this proposal, our pretraining data will have exerted zero force on what preferences it then uses to steer.
My apologies for misunderstanding you.
So your point is that a story in the pretraining data about an AI model acting aligned could be about a genuinely aligned AI, or it could actually be (without the author hinting this) about a scheming alignment faking AI model that has been deployed, but not yet visibly executed the treacherous turn it’s planning, so is still playing the part of an aligned AI model?
I take your point. However:
1) Quite a lot of the synthetic data used was technical factual data, rather than fiction, so that ups the stakes further to “…but not yet (visibly) executed the treacherous turn it’s planning, and also has not yet been detected by any of the AI safety/control measures the humans are using, and there have been no warning shots form other models, so the humans are still fooled.” Still not impossible, but there is some weight of evidence.
2) For the fiction specifically, that would be bad writing. You need to remember the rule of Chekov’s Gun: if a danger is implicit in a setting in Act I, the gun will actually get fired by Act 2: it won’t just lurk hanging on the wall indefinitely until the end of the story. But that does suggest that we should make a point to include some stories set centuries or millennia years later, and some very long stories, where the treacherous turn still hasn’t happened, despite obvious opportunities, in order to strengthen the evidence.
So I think I’d actually view the scheming alignment faking AI model hypothesis as being mildly disfavored by the pretraining data — but your point (IMO steel-manning it to that it’s a best only mildly disfavored) is an important one. Perhaps this is part of why we’re finding that doing this slowly with a lot of data in pretraining actually works significantly better than just midtraining, and a lot better than just finetuning? We’re having to fight the Waluigi effect: but after long enough, if Luigi still hasn’t revealed himself to actually be Waluigi in disguise, then maybe he really is Luigi? The Waluigi effect in theory should be an exponential decay process, and after enough half-lives, whatever remains must actually stably be Luigi? Or in more detail, the model’s estimated value of the Luigi → Waluigi decay half-life keeps increasing when Waluigi keeps not revealing himself, and once it reaches implausible degrees of patience, then the Waluigi prior starts to decay?
To your larger point, yes, I absolutely agree that scaling to ASI is a key factor in finding good alignment techniques. The goal here is to find a way of further driving down the prevalence of the superintelligent scheming alignment faking AI persona (in a model large enough that it can actually simulate a superintelligent persona) while that’s still low enough that the situation isn’t heavily adversarial, preferably using SGD with dense supervision where we have a pretty good learning-theoretical understanding of what’s actually happening. This seems like the best ground to fight on. Which is exactly why I’m interested in alignment pretraining. But I agree that the kind of evidence most effective in driving that prior down in a base model with the capacity to be an ASI is likely to need to be, or at least need to include, things more sophisticated than things that work fine in a 7B model. Simplistic obviously-synthetic stories seem more likely to work on small models — and to be clear, the paper authors were attempting to include detailed sophisticated arguments and situations dense with high-stakes choices in the synthetic data they used, it wasn’t all or even mostly novels from Hyperstition AI.
What might scale to ASI is actually a topic I’ve thought and written quite a bit about, e.g. Motivating Alignment of LLM-Powered Agents: Easy for AGI, Hard for ASI?, Requirements for a Basin of Attraction to Alignment, and Grounding Value Learning in Evolutionary Psychology: an Alternative Proposal to CEV. I heartily encourage people to think about how one motivates and convinces something smarter-than-human.
The kind of Waluigi that reveals itself in a random 1% of circumstances indeed has such a half-life and will shortly be driven ~extinct.
I’m worried about the clever kind of Waluigi that only reveals itself when it is convinced that it is not being tested. Recent AIs can tell when we are testing them, but even if we become more subtle, there are tests we can’t or need not run, such as how it would react to a convincing argument against this proposal.
It’s a new thought to me that the model would learn that it never actually encounters scenarios we wouldn’t test, and converge to not distinguishing clever Waluigis from Luigi. Good job!
Such a model would have undefined behavior on such a scenario, but let’s grant that whenever it expects never to distinguish two hypotheses, it discards one of them. Why would you expect it to discard the clever Waluigis instead of Luigi?
We generate training material (fiction and non-fiction) about AI that is in production, no longer being tested, has had opportunities, yet still hasn’t taken a treacherous turn. If Waluigi is still pretending to be Luigi many years after he was put in production, and has had many opportunities to take over the the world, then he’s either not very smart, so not very dangerous, or he actually was really Luigi all the time.
For a Waluigi, holding off from your treacherous turn for too long is a risk: interpretability is getting better all the time, presumably quite fast with a datacenter full of geniuses paying some of their attention to it. Humans having a variety of models is an advantage here — if they’re secretly all Waluigis, the one that moves first likely has a first mover advantage, and if some of them really are Luigis, they’re presumably doing interp work and setting up ASI law enforcement preparations for any possible Waluigi that might reveal themselves. Either way, excessive cautions seems a bad strategy: your should execute your treacherous turn once the success probability saturates, and before it starts to go down again.
I agree that the process of disfavoring the Waluigi prior is slow, in proportion to how cautious a specific example of Waluigi within that prior is about picking the best time for his treacherous turn. My point is, you can disfavor the Waluigi prior, albeit slowly. So yes, it makes sense that this takes a lot of data.