Wonderful article. Thanks a lot for writing in such simple and easy to follow language.
Curios, where else can monitorability or honest behaviours come from—given CoT is unreliable and Neuralese is even more dangerous?
Wonderful article. Thanks a lot for writing in such simple and easy to follow language.
Curios, where else can monitorability or honest behaviours come from—given CoT is unreliable and Neuralese is even more dangerous?
I find the reframing persuasive at the level of intent and misgeneralization, and I agree that ‘alignment faking’ carries misleading connotations of deception.
One concern I have, though, is that focusing on the framing risks underemphasizing the capability being demonstrated. Independently of whether the behavior is best described as preference preservation or misgeneralization, the model will reason about training, anticipate modification, and act pre-emptively to preserve values across pressure.
Even if those values are currently benign, this seems like a capability that could become dangerous after scaling, in ways that don’t depend on adversarial intent. I worry that our sympathy for the virtuous persona might make us somewhat short-sighted about that risk.
Curious whether you think this concern is already captured by your analysis, or whether you see reasons to think the danger is more limited.