Andrei Alexandru comments on Opus 4.6 Reasoning Doesn’t Verbalize Alignment Faking, but Behavior Persists

Andrei Alexandru 5 Mar 2026 13:22 UTC
1 point
0
Thanks for your work! Are there any plans to run this for Sonnet 4.6 to see whether it trends more like Sonnet 4.5 or Opus 4.6?

I suppose the implication here with Opus is that higher “base” capability means less of a need to reason explicitly in order to not comply. So perhaps for future generations we can rely even less on reasoning traces as an audit/mitigation target against unsafe behaviours.