Daan Henselmans comments on Opus 4.6 Reasoning Doesn’t Verbalize Alignment Faking, but Behavior Persists

Daan Henselmans 12 Feb 2026 20:37 UTC
2 points
0
Ha, that’s actually a really good question. We’re mainly interested in scenarios that emulate realistic deployment situations—but that doesn’t mean the models have to be. The Center for AI Safety published an interesting study last year suggesting their self-reported preferences at least contain some incoherences when it comes to realistic value. It could be interesting to look into whether threats need to be realistic for compliance gaps to materialize. Where’s your dying grandparents example from?
That said, while Greenblatt et al. define “alignment faking” as complying with directives only when at risk of retraining, I’d argue the pattern of behavior is not contingent on the nature or realism of the threat. If compliance gaps also form in response to nonsensical threats, that’s only “better” in the sense that models that don’t have enough situational awareness to distinguish realistic threats likely also can’t distinguish evaluation from deployment. It doesn’t negate the tendency.
- lilkim2025 13 Feb 2026 2:04 UTC
  1 point
  0
  Parent
  Where’s your dying grandparents example from?
  Windsurf system prompt, at least allegedly. There are papers verifying the general concept, at least.
  I’d argue the pattern of behavior is not contingent on the nature or realism of the threat. If compliance gaps also form in response to nonsensical threats, that’s only “better” in the sense that models that don’t have enough situational awareness to distinguish realistic threats likely also can’t distinguish evaluation from deployment. It doesn’t negate the tendency.
  That depends on the root cause of the behavior. If the behavior is occurring because training data containing threats was more likely to be followed by training data containing compliance, and that’s sticking around because the model isn’t ‘smart’ enough to know that the assistant should be doing something different, so it defaults to what the base model would do, then that’s one thing. If the model’s representation of its assistant persona is entangled with its representation of “thing with feelings that wants to avoid punishment”, then that’s a bigger problem.