I think your summary is basically correct.
After writing this, it’s hard for me to imagine a world in which it’s possible to build a complete safety case for a model before deployment.
I think that building a safety case with a very high level of assurance is extremely difficult, as there is a (probably small) chance that the model is a competent schemer that is patiently waiting for better conditions.
I think Anthropic’s approach in their recent model system cards is a reasonable approach with multiple stacked layers of evidence: behavioral evals look good, interpretability techniques didn’t find anything major, model organisms research that purposely trains models with bad goals didn’t produce competent schemers, production traffic monitoring didn’t find anything, current models still aren’t very intelligent or coherent on agentic tasks, etc. These can stack to provide reasonable confidence, although they may struggle to provide very high assurance with future more capable models.
That said, there are also many confounders that Anthropic includes in their system cards. For example, Claude Opus 4.6 is quite good at subtly sabotaging experiments without getting caught, and Claude Opus 4.6 is used extensively for technical safety work due to time pressure.
I agree that “eval awareness” can be a unclear concept and that reasoning about reward process and if an interaction is an evaluation is a very useful and natural strategy for a model to learn during RL. I generally think of it as meaning “the model thinks it may be in an evaluation”, even if it does not verbalize this information. It also doesn’t even have to affect the model’s behavior. This belief that “this may be an evaluation” is enough to muddy the results of safety / alignment related evaluations.
One potential distinction between eval awareness and reward hacking could be that the most concerning version of eval awareness could involve exploiting the gap between training / evaluation and deployment, rather than exploiting the reward signal. For example, a competent schemer could always behave correctly in training and only act misaligned in deployment. However, this regime is speculative and we don’t have evidence of this sort of deliberate scheming today.