williawa comments on Opus 4.6 Reasoning Doesn’t Verbalize Alignment Faking, but Behavior Persists

williawa 9 Feb 2026 19:16 UTC
10 points
1
Our results also indicate that current models are not yet evaluation-aware enough to close the compliance gap when tested through the public API, which is reassuring given concerns raised about unverbalized evaluation awareness in the Opus 4.6 system card.
..… is what they want you to think

(I’m mostly sarcastic, though not entirely)