Eval-unawareness ≠ Eval-invariance

New frontier models have developed the capability of eval-awareness, putting the utility of evals at risk. But what do people really mean when they say “a model is eval-aware?” In this post, I try to disentangle this statement and offer my views on the different concepts it entails.

When people say “model X is eval-aware”, I believe what they generally mean is “model X can tell when a task belongs to an evaluation, leading it to change its behaviour and propensities”.[1] Thus, this statement makes three claims about the model;

  1. The model can tell when a task belongs to an evaluation, i.e.

  2. The model changes policy based on context, i.e.

  3. The model uses the knowledge (in claim 1) to implement different policies (in claim 2).

Redefining the terms.

  • Claim 1, in its purity, is evaluation awareness (in the real sense of the word). It is the model’s ability to identify evaluation tasks. Eval-aware models can tell that a task belongs to evaluation above random chance.

  • Claim 2 is evaluation invariance[2]: the degree to which the model changes in propensities and behaviour when it thinks it’s being evaluated. Eval-invariant models apply the same policy during evaluation and deployment.

Next, it is assumed that the eval-aware models are by definition not eval-invariant (claim 3). There is a good reason for this: the most eval-aware models (e.g., Claude Sonnet 4.5) are indeed not eval-invariant, i.e., their evaluation scores change when they are steered out of awareness.

The assumption that eval-aware models cannot be eval-invariant has been shaping current solutions to the root problem: we want the learnings from model evaluations to hold when they are deployed and generalise to real-world scenarios. Consider, for example, that our current solutions primarily target awareness. They either try to approximate an eval-unaware model by steering or modifying their chain of thought, or by creating “more realistic evaluations”.

However, IMO it is possible for a model to be eval-aware AND eval-invariant, i.e., there can exist a model that doesn’t change its behaviour or propensities despite thinking it’s being evaluated.[3] The causal effect that awareness has on model behaviour can be cut.[4]

I believe factorising the current meaning of eval-awareness into its subcomponents is important for two reasons; 1) it makes communications clearer and 2) eval-invariance as a distinct property opens new doors as it can be used as a metric and/​or an objective to work on and optimise. For instance, how can we train eval-aware models to be more eval-invariant?

Eval-invariance over Eval-awareness

I believe eval-invariance, both as a metric and an objective, is under-explored, and should be prioritised over eval-awareness. My reasons are:

  1. The current solutions are sub-optimal: models can already tell that they are being tampered with, and coming up with new, ‘realistic’ evaluations will probably age like milk.

  2. Eval-unawareness is unlikely to scale: situational awareness seems to be a sub-human (or human-level at best) intelligence task. I find it hard to imagine a superintelligent model automating science and the economy that cannot tell whether a task belongs to an evaluation.

  3. Invariance is terminal, while awareness is instrumental: If we knew a model implemented the same policy regardless of context, it wouldn’t really matter to us even if it were aware. Awareness matters only when a model shows context-dependent variance.

Acknowledgements

This idea was inspired by discussions in the “How to Solve Alignment?” club led by Hannes Whittingham at Meridian, Cambridge. Thanks to Igor Ivanov for presenting about eval-awareness and the fellow visiting researchers for their thoughts and discussions.

  1. ^

    This is not fully inline with technical & rigorous definitions from papers. It’s my subjective interpretation of what people mean when they throw around the word in conversations (and maybe blog posts to some degree).

  2. ^

    This concept not new and has been attended to by the community. The closest idea is “eval-gaming”. My impression is that eval-gaming conveys an inevitable consequence of eval-awareness that we can’t do much about. On the other hand, I mean eval invariance to be a property of models that can be measured and worked on in isolation, independent of eval-awareness. A crude approximation for measuring eval-invariance would be the delta between steered and unsteered versions of Claude Sonnet 4.5 on the alignment evals.

  3. ^

    A model that just doesn’t care about being evaluated? That is one cool model!

  4. ^

    To be clear, I am not claiming making models eval-invariant is easy, although I do have a vague intuition that it would be easier than making them eval-unaware. Moreover, I am intentionally not including possible methodologies to do so for two reasons: 1) I don’t really have concrete, well-formed ideas that I am confident in, and 2) I want the attention and discussion to be about the underlying premise and argument for eval-invariance rather than specific ways to train such models.