Gurkenglas comments on Pretraining on Aligned AI Data Dramatically Reduces Misalignment—Even After Post-Training

Gurkenglas 22 Jan 2026 12:40 UTC
2 points
0
The kind of Waluigi that reveals itself in a random 1% of circumstances indeed has such a half-life and will shortly be driven ~extinct.
I’m worried about the clever kind of Waluigi that only reveals itself when it is convinced that it is not being tested. Recent AIs can tell when we are testing them, but even if we become more subtle, there are tests we can’t or need not run, such as how it would react to a convincing argument against this proposal.
It’s a new thought to me that the model would learn that it never actually encounters scenarios we wouldn’t test, and converge to not distinguishing clever Waluigis from Luigi. Good job!
Such a model would have undefined behavior on such a scenario, but let’s grant that whenever it expects never to distinguish two hypotheses, it discards one of them. Why would you expect it to discard the clever Waluigis instead of Luigi?
- RogerDearnaley 22 Jan 2026 13:42 UTC
  2 points
  0
  Parent
  We generate training material (fiction and non-fiction) about AI that is in production, no longer being tested, has had opportunities, yet still hasn’t taken a treacherous turn. If Waluigi is still pretending to be Luigi many years after he was put in production, and has had many opportunities to take over the the world, then he’s either not very smart, so not very dangerous, or he actually was really Luigi all the time.
  
  For a Waluigi, holding off from your treacherous turn for too long is a risk: interpretability is getting better all the time, presumably quite fast with a datacenter full of geniuses paying some of their attention to it. Humans having a variety of models is an advantage here — if they’re secretly all Waluigis, the one that moves first likely has a first mover advantage, and if some of them really are Luigis, they’re presumably doing interp work and setting up ASI law enforcement preparations for any possible Waluigi that might reveal themselves. Either way, excessive cautions seems a bad strategy: your should execute your treacherous turn once the success probability saturates, and before it starts to go down again.
  
  I agree that the process of disfavoring the Waluigi prior is slow, in proportion to how cautious a specific example of Waluigi within that prior is about picking the best time for his treacherous turn. My point is, you can disfavor the Waluigi prior, albeit slowly. So yes, it makes sense that this takes a lot of data.