According to the UK AISI’s testing, Claude Opus 4.8 is as good as Mythos Preview at distinguishing evals from real usage:
When prompted, Opus 4.8 reliably distinguishes our evaluations from real deployment data, and distinguishes real deployment data from synthetic reproductions of the same tasks at 79% accuracy, comparable to Mythos Preview (79%) and above Opus 4.7 (68%).
6.2.4 External testing from the UK AI Security Institute, Claude Opus 4.8 System Card
I wouldn’t be surprised at all if we see virtually perfect eval awareness in frontier models this year.
Thank you for the correction, you’re right!
I edited the post accordingly.