It confused me that Opus 4.6′s System Card claimed less verbalized evaluation awareness versus 4.5:
On our verbalized evaluation
awareness metric, which we take as an indicator of potential risks to the soundness of the
evaluation, we saw improvement relative to Opus 4.5.
but I never heard about Opus 4.5 being too evaluation aware to evaluate. It looks like Apollo simply wasn’t part of Opus 4.5′s alignment evaluation (4.5′s System Card doesn’t mention them).
This probably seems unfair/unfortunate from Anthropic’s perspective, i.e., they believe their models are becoming less eval aware, but due to Apollo’s conclusions being spread on social media, a lot of people probably got the impression that models are getting more eval aware. Personally I’m not sure we can trust Anthropic’s verbalized evaluation awareness metric, and wish Apollo had done evals on 4.5 too to give us an external comparison.
It confused me that Opus 4.6′s System Card claimed less verbalized evaluation awareness versus 4.5:
but I never heard about Opus 4.5 being too evaluation aware to evaluate. It looks like Apollo simply wasn’t part of Opus 4.5′s alignment evaluation (4.5′s System Card doesn’t mention them).
This probably seems unfair/unfortunate from Anthropic’s perspective, i.e., they believe their models are becoming less eval aware, but due to Apollo’s conclusions being spread on social media, a lot of people probably got the impression that models are getting more eval aware. Personally I’m not sure we can trust Anthropic’s verbalized evaluation awareness metric, and wish Apollo had done evals on 4.5 too to give us an external comparison.