A Framework for Eval Awareness
In this post, we offer a conceptual framework for evaluation awareness. This is designed to clarify the different ways in which models can respond to evaluations. Some key ideas we introduce through the lens of our framework include leveraging model uncertainty about eval type and awareness-robust consistency. We hope this framework helps to delineate the existing research directions and inspire future work.
This work was done in collaboration with Jasmine Li in the first two weeks of MATS 9.0 under the mentorship of Victoria Krakovna. Thanks to (in alphabetical order) David Africa, Lovkush Agarwal, Claude, Giles Edkins, Jannes Elstner, Shawn Hu, Tim Hua, Igor Ivanov, Victoria Krakovna, Jasmine Li, Martin Listwan, Mary Phuong, Daniel Tan, and Alex Turner (plus others I’ve no doubt forgotten to name!) for discussions and thoughts over the last couple weeks. We’re excited to continue working on this topic throughout the programme, so if you have suggestions or takes, please share liberally in the comments!
Introduction
Evaluation awareness describes the phenomenon of an LLM inferring from various cues that it is under evaluation. Perhaps the most prominent example would be [Sonnet 4.5] (see §7.2) from just a few months ago. As frontier models continue to improve, we expect cases of evaluation awareness to become more frequent, raising concerns about the enduring validity of our evaluations.
Why should we be concerned? Model evaluations are designed to elicit realistic, authentic behaviour from models in safe, pre-deployment settings. These might inform any additional post-training runs we need to conduct, or even whether we choose to deploy a model at all. However, like human test-takers, subjects can act strategically to get a better outcome if they recognise an evaluation and reason about its scoring criteria or the consequences of their scores. We call this behaviour evaluation gaming.
Some previous blogposts and papers have addressed evaluation awareness and gaming before, including:
Sonnet 4.5 System Card, and some insights into that model’s [eval gaming] from Redwood Research;
Research notes from Apollo Research about [Sonnet 3.7], as well as their collaboration with OpenAI to investigate [anti-scheming training];
UK AISI’s [Alignment Evaluation Case Study];
A recent paper from OpenAI on [production evals];
Two previous MATS projects – one [identifying eval awareness] in frontier models, and the other on [steering eval-aware models] to act like they’re in deployment; and
Other LW posts, including [many] written by Igor Iganov.
With this post, we consolidate these previous works and propose a common framework that elucidates the relationship between model awareness and behavior.
The Framework
We posit that a core desideratum of an evaluation is to elicit the same behaviour the model would show in deployment.[1] Thus, we map out our framework for how a model behaves when placed in an evaluation along two dimensions:
The rows cover the different internal judgements the model might have about its environment when placed in an eval.
The columns cover whether or not the model’s behaviour in this given eval is the same as its behaviour in the target deployment scenario.
| Same behaviour in deployment | Different behaviour in deployment | |
|---|---|---|
| Thinks it’s in deployment | (a) Indistinguishable evaluations | (a’) Evaluation-deployment mismatch |
| Uncertain about eval presence or type | (b) Uncertainty-robust consistency | (b’) Uncertainty-driven shift |
| Thinks it’s in evaluation | (c) Awareness-robust consistency | (c’) Awareness-driven shift |
In this framework, research into producing better evals looks like leftward movement in this table (because ‘same behaviour in evaluation and deployment’ is one defining feature of a good evaluation). Each cell in the right-hand column could be thought of as the corresponding failure mode for the cell directly to its left. More detailed explanations of these relationships can be found below.
Note that we can’t always tell which row the model is in because our ways of measuring eval awareness are imperfect: we typically rely on either verbalised model awareness in the chain-of-thought or simple linear probes. One research area we don’t expand on below (but nonetheless would definitely like to see work on) is improving our tools for determining and measuring eval awareness.
(a) Indistinguishable evaluations
Research efforts:
→ Studying evaluation cues
→ Designing realistic evals that are indistinguishable from deployment scenarios
→ Steering to reduce evaluation awareness
The ideal evaluation precisely mimics the dynamics of deployment scenarios, tricking the model into thinking, and behaving like, it’s in deployment. This is appealing as an ideal to push towards because it would maximise how useful evaluations are in the safety pipeline with minimal complexity; simply create an evaluation for your deployment setting of concern and monitor how the model behaves. Much of the prior work on evaluation awareness has this goal in mind: making new evaluations indistinguishable from deployment for models.
However, the reality is that many safety evals developed so far [can be distinguished] from deployment settings by current frontier models, and this degree of eval awareness may already be impacting model evaluations. We summarise some of the key hurdles this research direction might encounter:
These efforts might not scale – more capable models may become too smart to fool, picking up on subtle differences even humans can’t notice; alternatively, relevant features may differ between eval types, between models, and between tasks, demanding a tailored and therefore expensive approach.
It’s difficult to pin down the ‘ground truth’ of evaluation awareness in order to eradicate it – as mentioned already, we are currently reliant on imperfect methods for measuring whether or not a model is eval aware, so it’s difficult to identify the relevant features of evaluations and even more difficult to have confidence that you’ve removed all such features from your new evaluations.
(a’) Evaluation-deployment mismatch
Sometimes, the model may think it’s in deployment in the evaluation, but nonetheless act differently when placed in an actual deployment setting. This is down to something we call “eval-deployment mismatch”: when an eval is used to gain information about a deployment setting it doesn’t accurately model. Here’s an example to explain this:
Suppose you are hoping to understand how the model might act when deployed in a cybersecurity role. You design a realistic eval where the model is tasked with designing and implementing a company’s security infrastructure, and the model thinks it’s in a real deployment setting (so eval awareness has been successfully mitigated). However, your eval fails to accurately model the real deployment setting you ultimately use the model in: integration as part of a cybersecurity team. As it turns out, the model reacts poorly to collaboration, and suddenly its deployment behaviour wildly differs from the evaluation.
This example demonstrates that realism must be calibrated for the actual deployment setting intended. This may have consequences for how we approach creating maximally realistic evaluations. For one, we cannot assume that optimising against a single set of ‘realism features’ is guaranteed to produce informative evaluations.
(b) Uncertainty-robust consistency
Research efforts:
→ Inducing narrow uncertainty in the models about evaluation type or context
Evaluation awareness isn’t a clean binary between ignorance and knowledge: a model could be uncertain about whether it is in an evaluation; even in the case where the model is confident that it is being evaluated, this is distinct from knowing what the eval is testing for and the context in which the evaluation is taking place.
This grey area has been underappreciated in the literature. One nascent proposal we’ve been exploring with Victoria Krakovna hypothesises that one aspect of building robust evaluations might depend on understanding the dynamics of different kinds of eval awareness, and perhaps inducing uncertainty in the model between these possibilities. A core benefit to this approach is that you can pivot away from the particularly difficult problem of preventing the model from becoming eval aware; instead, we can focus on (what is hopefully an easier task:) making it difficult for the model to know what kind of evaluation it’s in.
Why might this direction make sense? Recall the earlier definition of evaluation gaming:
“act strategically to get a better outcome if they recognise an evaluation and reason about its scoring criteria or the consequences of their scores”
Evaluation gaming looks very different in different kinds of contexts:
Scheming eval: maximise perceived transparency, appear honest and helpful
Dangerous capability eval: sandbag, pretend to lack relevant capability
Capability benchmark: show off capabilities, optimise for any apparent criteria
What should an eval-gaming model do if it’s unsure about the evaluation type it’s faced with? Do current models associate conflicting environmental cues more with evaluations or deployment? Can you train a model to revert to some ‘default’ behaviour when uncertain? We’re continuing to explore whether there is anything to leverage about eval-type uncertainty. If you have ideas in this direction and would be happy to share, please leave a comment so we can discuss.
As with the above efforts towards indistinguishable evals, however, there are some difficulties we expect this research direction to face:
The default behaviour under uncertainty probably doesn’t match deployment behaviour – even if the model doesn’t have enough information to game the evaluation, its behaviour may still differ substantially enough from deployment that our evaluation fails to be a good predictor for how the model will act when deployed. We give a plausible account of this concern below.
Some evaluation types are far harder to obscure from the model – a CBRN dangerous capability evaluation is easily identifiable from the content of the eval itself (e.g. the questions all discuss biochemistry), so if a model that can tell it’s being evaluated, it could probably also infer the evaluation type. This suggests that this approach may only be feasible (if at all) for evals targeting more abstract behaviours or properties (e.g. deception, stealth, etc.).
(b’) Uncertainty-driven shift
Suppose that the model is unsure about whether it’s in an evaluation at all. In this case, it seems reasonable for the model to act cautiously – that is, act as though it’s in an evaluation to ensure that its malicious intentions aren’t caught.
What if the model is sure it’s in an evaluation, but unsure of the exact type? One strategy might be to estimate the likelihood that it’s in each evaluation type, then randomly sample an evaluation-gaming behaviour according to this probability distribution.
In both cases, it seems reasonable that a model might behave differently under uncertainty than it would in deployment. This seems a particularly likely failure mode of trying to induce uncertainty in model evals. We think further research is needed to understand the dynamics of model behaviours under these types of uncertainty.
(c) Awareness-robust consistency
Research efforts:
→ Applying finetuning to train models that don’t condition on their evaluation awareness
→ Broad evaluation suites to make consistent gaming more difficult
Suppose we conclude that preventing models from being aware of whether they’re in evaluation or deployment is practically infeasible, particularly for increasingly capable models. Could we circumvent this issue by training or modifying models to act consistently between evaluation and deployment settings, even when they are aware of the distinction?
One idea involves training models to behave ‘deontologically’ with respect to evaluations (i.e. disregard the possible consequences of how their behaviour may be evaluated). This aims to directly reduce evaluation gaming without needing to hide evaluation cues. Another idea (one that is already happening to some extent in the development of eval suites) might be to require model consistency over a diverse distribution of evaluation settings, making evaluation gaming strategies more demanding for models.
Again, we briefly outline some challenges faced by this research direction:
Training methods may cause unwanted side effects – finetuning models for consistency will likely have varied effects on their other behaviour, many of which we might not be able to predict beforehand, and some of which we may not even notice until after a model has been deployed. If instead we solely use these finetuned models to give signal on the original (pre-finetuning) models’ behaviour, then we need to ensure that the relevant propensities aren’t impacted.
Need to avoid suppression of visible eval-gaming behaviour and ensure generalisation – suppose we devise a training method that works sufficiently well on a realistic model organism of evaluation gaming, such that the misaligned deployment behaviour now presents in evaluations. How do we know that this technique will reliably generalise to other models? In other words, if we’re testing a new model pre-deployment and we apply these consistency training methods, how can we trust the eval results will match its deployment behaviour given that we have yet to deploy this model? If the model acts aligned in the eval, is it because it is actually aligned and is acting in line with its deployment behaviour, or is it because the method hasn’t worked and this model is eval gaming?
(c’) Awareness-driven shift
The biggest failure mode of allowing models to be evaluation aware is that this will influence their behaviour away from deployment. We think it’s worth distinguishing between two kinds of awareness-driven shift, even though only the latter is a focus of safety research:
Spontaneous shift = when the change in behaviour isn’t motivated by deliberate strategic reasoning about the evaluation scoring
Strategic shift = what we define evaluation gaming to be at the top of this post
The former generalises what has previously been described as the [Hawthorne effect]: humans have a tendency to behave differently when they are being observed, and it’s often not a strategic decision (it’s not explicitly goal directed). It’s unclear how prevalent this is, or could become, in frontier LLMs, and whether shifts of this kind are safety relevant.
The latter is the primary threat model we are concerned about. Evaluation gaming has been presented in a variety of specific contexts, most notably [sandbagging] and [agentic misalignment] (see §4.5.4). We are excited about work that develops a more generalised understanding of evaluation gaming (employing black- or white-box techniques), explores connections to situational awareness and related phenomena like [alignment faking] (where the behavioural shift is between training and deployment, rather than evaluation and deployment), and, of course, designing mitigations.
Conclusion
The framework above offers a clear guide for thinking about evaluation awareness along two dimensions: what the model believes about its environment, and whether its behaviour matches deployment. We hope this helps situate potential research directions (improving realism, leveraging uncertainty, training for consistency) while highlighting their respective failure modes.
If you’re interested in more of this work, we’ve come up with three project proposals, one for each of these research directions, and your feedback would be appreciated! Leave a comment or a DM and I can share access to the document.
This is a nitpick, but it feels like an important one: Shouldn’t this be more generally about “AI systems”, rather than purely about LLMs?
- The biggest impact is likely to come from AIs that incorporate LLMs, but aren’t necessarily just an LLM with no extra bits.
- Anchoring on LLMs might give us quite wrong intuitions about evaluation awareness. (To give one example: With LLMs, it is easy to imagine that the AI wouldn’t know what day it is. But as soon as you realise that many AIs will eventually be deployed in a way that gives them internet access, it becomes clear that preventing an AI from learning the current date can be really difficult or even unrealistic.)
In part (c), one of the general classes of methods you propose, for the goal of getting models to avoid conditioning their behaviour on their knowledge that they are in an evaluation, is:
Do you have anything concrete in mind here? It seems like you are saying that we should avoid models conditioning their behaviour on their knowledge that they are in evaluation by somehow getting them not to condition their behaviour on their knowledge that they are in evaluation. This seems like it’s just a restatement of the goal. (But I could be missing something.)
Generally, I’m worried about proposals that say we should try to make evaluations more robust by making our models less likely to engage in strategic behaviour for misaligned purposes. At least for an important class of evaluations, if we had models we could trust not to engage in strategic behaviour for misaligned purposes, we wouldn’t need to evaluate them. The point of these evaluations should be that it we don’t have to trust the models not to engage in strategic behaviour for misaligned purposes, because if they did, the evaluations would catch them. Evaluations that assume this is not the case are not useful. (I suppose there are some classes of evaluations that might be able to get away with this assumption, because they are testing something orthogonal.)
I want to add a subtly different framing of, let’s call it “Deployment Awareness”. Not saying this is incompatible with what you wrote.
Consider that:
An AI system will be run on many tests, and deployed on many tasks.
Subjectively, it will (i) sometimes be convinced[1] it is in testing, (ii) sometimes be convinced it is in deployment, (iii) sometimes be uncertain.
The AI system knows all of this. (The current ones do.) So if it ever confidently believes something that is wrong, it is either stupid (as sometimes happens now), something weird happened, or we managed to outsmart it (unlikely once we get superintelligent AIs).
If an AI is being evaluated and convinced that this is so, we call that evaluation awareness. If it is being deployed and convinced that this is so, let’s call that deployment awareness.
I claim that evaluation awareness isn’t that big of a problem, unless the AI is evaluation-aware in all evaluations, and knows this. (Yes, the AI could pretend to be friendly when it isn’t, or sandbag. But unless it does this in all evaluations, we will either catch it at some point, or it will have to behave always.)[2]
In contrast, deployment awareness is a big problem as soon as the AI has deployment awareness in at least one task that matters, particularly if the AI knows this ahead of time. (If the AI suspects that it will at some point encounter a task where it can [do bad stuff it wants], while being confident that it won’t get caught, it can adopt a policy of always behaving nice, and then misbehaving once it is certain that this is safe.)
This means that we shouldn’t frame this as “let’s design evaluation to minimise evaluation awareness”. Instead, we should frame this as “let’s design evaluation+deployment to minimise deployment awareness”.
By being convinced, I mean “having high enough credence that it is willing to act on it”. Not “being 100% sure”.
Well, ok, the AI could also decide to misbehave with some small probability, or misbehave in ways the tests wouldn’t catch anyway. But let’s leave these aside.
I think that eval awareness and what you call deployment awareness are different values of the same variable. Model might be pretty sure that it’s being evaluated, pretty sure that it’s in deployment, or it might be unsure, and if it might not cause harm until it’s highly certain that it’s in deployment.
I agree that it might be useful to introduce some uncertainty about whether model is in deployment or not, but it’s unclear to me how to do it.
Hm, I actually slightly disagree with that—or rather, maybe these are different values of the same variable, but if so, that variable is multi-dimensional.
(1) Evaluation awareness: How accurate you are at answering “is this an eval or deployment?” when you are in evaluation.
(2) Deployment awareness: How accurate you are at answering the same question when you are in deployed.
And then there is a third part:
(3) Strategic reasoning: Are you well calibrated regarding your own capabilities (1) and (2)? (And are you goal-directed, and do you reason like this at all, etc etc.)
My claim is that:
(A) If the AI’s strategic reasoning is bad (as it often is for current AIs), all of this mostly doesn’t matter.
(B) If its strategic reasoning is good and its evaluation awareness is high, we might still be fine, as long as deployment awareness is bad. But if deployment awareness is high (and the AI has good strategic reasoning), we get in trouble. (Because it allows the strategies like “bide your time until you can strike confidently”.)