Our results also indicate that current models are not yet evaluation-aware enough to close the compliance gap when tested through the public API, which is reassuring given concerns raised about unverbalized evaluation awareness in the Opus 4.6 system card.
..… is what they want you to think
(I’m mostly sarcastic, though not entirely)