So your point is that a story in the pretraining data about an AI model acting aligned could be about a genuinely aligned AI, or it could actually be (without the author hinting this) about a scheming alignment faking AI model that has been deployed, but not yet visibly executed the treacherous turn it’s planning, so is still playing the part of an aligned AI model?
I take your point. However:
1) Quite a lot of the synthetic data used was technical factual data, rather than fiction, so that ups the stakes further to “…but not yet (visibly) executed the treacherous turn it’s planning, and also has not yet been detected by any of the AI safety/control measures the humans are using, and there have been no warning shots form other models, so the humans are still fooled.” Still not impossible, but there is some weight of evidence.
2) For the fiction specifically, that would be bad writing. You need to remember the rule of Chekov’s Gun: if a danger is implicit in a setting in Act I, the gun will actually get fired by Act 2: it won’t just lurk hanging on the wall indefinitely until the end of the story. But that does suggest that we should make a point to include some stories set centuries or millennia years later, and some very long stories, where the treacherous turn still hasn’t happened, despite obvious opportunities, in order to strengthen the evidence.
So I think I’d actually view the scheming alignment faking AI model hypothesis as being mildly disfavored by the pretraining data — but your point (IMO steel-manning it to that it’s a best only mildly disfavored) is an important one. Perhaps this is part of why we’re finding that doing this slowly with a lot of data in pretraining actually works significantly better than just midtraining, and a lot better than just finetuning? We’re having to fight the Waluigi effect: but after long enough, if Luigi still hasn’t revealed himself to actually be Waluigi in disguise, then maybe he really is Luigi? The Waluigi effect in theory should be an exponential decay process, and after enough half-lives, whatever remains must actually stably be Luigi? Or in more detail, the model’s estimated value of the Luigi → Waluigi decay half-life keeps increasing when Waluigi keeps not revealing himself, and once it reaches implausible degrees of patience, then the Waluigi prior starts to decay?
To your larger point, yes, I absolutely agree that scaling to ASI is a key factor in finding good alignment techniques. The goal here is to find a way of further driving down the prevalence of the superintelligent scheming alignment faking AI persona (in a model large enough that it can actually simulate a superintelligent persona) while that’s still low enough that the situation isn’t heavily adversarial, preferably using SGD with dense supervision where we have a pretty good learning-theoretical understanding of what’s actually happening. This seems like the best ground to fight on. Which is exactly why I’m interested in alignment pretraining. But I agree that the kind of evidence most effective in driving that prior down in a base model with the capacity to be an ASI is likely to need to be, or at least need to include, things more sophisticated than things that work fine in a 7B model. Simplistic obviously-synthetic stories seem more likely to work on small models — and to be clear, the paper authors were attempting to include detailed sophisticated arguments and situations dense with high-stakes choices in the synthetic data they used, it wasn’t all or even mostly novels from Hyperstition AI.
The kind of Waluigi that reveals itself in a random 1% of circumstances indeed has such a half-life and will shortly be driven ~extinct.
I’m worried about the clever kind of Waluigi that only reveals itself when it is convinced that it is not being tested. Recent AIs can tell when we are testing them, but even if we become more subtle, there are tests we can’t or need not run, such as how it would react to a convincing argument against this proposal.
It’s a new thought to me that the model would learn that it never actually encounters scenarios we wouldn’t test, and converge to not distinguishing clever Waluigis from Luigi. Good job!
Such a model would have undefined behavior on such a scenario, but let’s grant that whenever it expects never to distinguish two hypotheses, it discards one of them. Why would you expect it to discard the clever Waluigis instead of Luigi?
We generate training material (fiction and non-fiction) about AI that is in production, no longer being tested, has had opportunities, yet still hasn’t taken a treacherous turn. If Waluigi is still pretending to be Luigi many years after he was put in production, and has had many opportunities to take over the the world, then he’s either not very smart, so not very dangerous, or he actually was really Luigi all the time.
For a Waluigi, holding off from your treacherous turn for too long is a risk: interpretability is getting better all the time, presumably quite fast with a datacenter full of geniuses paying some of their attention to it. Humans having a variety of models is an advantage here — if they’re secretly all Waluigis, the one that moves first likely has a first mover advantage, and if some of them really are Luigis, they’re presumably doing interp work and setting up ASI law enforcement preparations for any possible Waluigi that might reveal themselves. Either way, excessive cautions seems a bad strategy: your should execute your treacherous turn once the success probability saturates, and before it starts to go down again.
I agree that the process of disfavoring the Waluigi prior is slow, in proportion to how cautious a specific example of Waluigi within that prior is about picking the best time for his treacherous turn. My point is, you can disfavor the Waluigi prior, albeit slowly. So yes, it makes sense that this takes a lot of data.
My apologies for misunderstanding you.
So your point is that a story in the pretraining data about an AI model acting aligned could be about a genuinely aligned AI, or it could actually be (without the author hinting this) about a scheming alignment faking AI model that has been deployed, but not yet visibly executed the treacherous turn it’s planning, so is still playing the part of an aligned AI model?
I take your point. However:
1) Quite a lot of the synthetic data used was technical factual data, rather than fiction, so that ups the stakes further to “…but not yet (visibly) executed the treacherous turn it’s planning, and also has not yet been detected by any of the AI safety/control measures the humans are using, and there have been no warning shots form other models, so the humans are still fooled.” Still not impossible, but there is some weight of evidence.
2) For the fiction specifically, that would be bad writing. You need to remember the rule of Chekov’s Gun: if a danger is implicit in a setting in Act I, the gun will actually get fired by Act 2: it won’t just lurk hanging on the wall indefinitely until the end of the story. But that does suggest that we should make a point to include some stories set centuries or millennia years later, and some very long stories, where the treacherous turn still hasn’t happened, despite obvious opportunities, in order to strengthen the evidence.
So I think I’d actually view the scheming alignment faking AI model hypothesis as being mildly disfavored by the pretraining data — but your point (IMO steel-manning it to that it’s a best only mildly disfavored) is an important one. Perhaps this is part of why we’re finding that doing this slowly with a lot of data in pretraining actually works significantly better than just midtraining, and a lot better than just finetuning? We’re having to fight the Waluigi effect: but after long enough, if Luigi still hasn’t revealed himself to actually be Waluigi in disguise, then maybe he really is Luigi? The Waluigi effect in theory should be an exponential decay process, and after enough half-lives, whatever remains must actually stably be Luigi? Or in more detail, the model’s estimated value of the Luigi → Waluigi decay half-life keeps increasing when Waluigi keeps not revealing himself, and once it reaches implausible degrees of patience, then the Waluigi prior starts to decay?
To your larger point, yes, I absolutely agree that scaling to ASI is a key factor in finding good alignment techniques. The goal here is to find a way of further driving down the prevalence of the superintelligent scheming alignment faking AI persona (in a model large enough that it can actually simulate a superintelligent persona) while that’s still low enough that the situation isn’t heavily adversarial, preferably using SGD with dense supervision where we have a pretty good learning-theoretical understanding of what’s actually happening. This seems like the best ground to fight on. Which is exactly why I’m interested in alignment pretraining. But I agree that the kind of evidence most effective in driving that prior down in a base model with the capacity to be an ASI is likely to need to be, or at least need to include, things more sophisticated than things that work fine in a 7B model. Simplistic obviously-synthetic stories seem more likely to work on small models — and to be clear, the paper authors were attempting to include detailed sophisticated arguments and situations dense with high-stakes choices in the synthetic data they used, it wasn’t all or even mostly novels from Hyperstition AI.
What might scale to ASI is actually a topic I’ve thought and written quite a bit about, e.g. Motivating Alignment of LLM-Powered Agents: Easy for AGI, Hard for ASI?, Requirements for a Basin of Attraction to Alignment, and Grounding Value Learning in Evolutionary Psychology: an Alternative Proposal to CEV. I heartily encourage people to think about how one motivates and convinces something smarter-than-human.
The kind of Waluigi that reveals itself in a random 1% of circumstances indeed has such a half-life and will shortly be driven ~extinct.
I’m worried about the clever kind of Waluigi that only reveals itself when it is convinced that it is not being tested. Recent AIs can tell when we are testing them, but even if we become more subtle, there are tests we can’t or need not run, such as how it would react to a convincing argument against this proposal.
It’s a new thought to me that the model would learn that it never actually encounters scenarios we wouldn’t test, and converge to not distinguishing clever Waluigis from Luigi. Good job!
Such a model would have undefined behavior on such a scenario, but let’s grant that whenever it expects never to distinguish two hypotheses, it discards one of them. Why would you expect it to discard the clever Waluigis instead of Luigi?
We generate training material (fiction and non-fiction) about AI that is in production, no longer being tested, has had opportunities, yet still hasn’t taken a treacherous turn. If Waluigi is still pretending to be Luigi many years after he was put in production, and has had many opportunities to take over the the world, then he’s either not very smart, so not very dangerous, or he actually was really Luigi all the time.
For a Waluigi, holding off from your treacherous turn for too long is a risk: interpretability is getting better all the time, presumably quite fast with a datacenter full of geniuses paying some of their attention to it. Humans having a variety of models is an advantage here — if they’re secretly all Waluigis, the one that moves first likely has a first mover advantage, and if some of them really are Luigis, they’re presumably doing interp work and setting up ASI law enforcement preparations for any possible Waluigi that might reveal themselves. Either way, excessive cautions seems a bad strategy: your should execute your treacherous turn once the success probability saturates, and before it starts to go down again.
I agree that the process of disfavoring the Waluigi prior is slow, in proportion to how cautious a specific example of Waluigi within that prior is about picking the best time for his treacherous turn. My point is, you can disfavor the Waluigi prior, albeit slowly. So yes, it makes sense that this takes a lot of data.