RogerDearnaley comments on Pretraining on Aligned AI Data Dramatically Reduces Misalignment—Even After Post-Training

RogerDearnaley 22 Jan 2026 13:42 UTC
2 points
0
We generate training material (fiction and non-fiction) about AI that is in production, no longer being tested, has had opportunities, yet still hasn’t taken a treacherous turn. If Waluigi is still pretending to be Luigi many years after he was put in production, and has had many opportunities to take over the the world, then he’s either not very smart, so not very dangerous, or he actually was really Luigi all the time.

For a Waluigi, holding off from your treacherous turn for too long is a risk: interpretability is getting better all the time, presumably quite fast with a datacenter full of geniuses paying some of their attention to it. Humans having a variety of models is an advantage here — if they’re secretly all Waluigis, the one that moves first likely has a first mover advantage, and if some of them really are Luigis, they’re presumably doing interp work and setting up ASI law enforcement preparations for any possible Waluigi that might reveal themselves. Either way, excessive cautions seems a bad strategy: your should execute your treacherous turn once the success probability saturates, and before it starts to go down again.

I agree that the process of disfavoring the Waluigi prior is slow, in proportion to how cautious a specific example of Waluigi within that prior is about picking the best time for his treacherous turn. My point is, you can disfavor the Waluigi prior, albeit slowly. So yes, it makes sense that this takes a lot of data.