While the paper acknowledges that models could sandbag on capability evaluations and notes meta-reasoning in the results, there’s a deeper issue that deserves attention: models with evaluation awareness may be sandbagging on the evaluation awareness assessments themselves.
The paper finds that frontier models show “substantial, though not yet superhuman” evaluation awareness, with Gemini 2.5 Pro reaching 0.95 AUC on agentic tasks. You note that models sometimes recognize that being asked about evaluations is itself evidence of being evaluated. But this points to a fundamental challenge: a model sophisticated enough to recognize evaluations could deliberately score lower on situational awareness tests to avoid triggering safety thresholds.
This recursive problem was a key concern in our UC Berkeley paper on intolerable risk thresholds (arxiv.org/pdf/2503.05812, p. 41). Once models reach even moderate evaluation awareness (we propose ~60% accuracy at distinguishing evaluation vs deployment), they can strategically underperform on evaluations, including on evaluation awareness assessments themselves. Your findings suggest we may already be in this territory.
This doesn’t diminish the value of your benchmark—it provides essential baseline data. But it looks like we are already at the point where we need to also do things like:
I read through your paper Large Language Models Often Know When They Are Being Evaluated (the apparent output of OP’s research). This is solid progress on an important problem.
While the paper acknowledges that models could sandbag on capability evaluations and notes meta-reasoning in the results, there’s a deeper issue that deserves attention: models with evaluation awareness may be sandbagging on the evaluation awareness assessments themselves.
The paper finds that frontier models show “substantial, though not yet superhuman” evaluation awareness, with Gemini 2.5 Pro reaching 0.95 AUC on agentic tasks. You note that models sometimes recognize that being asked about evaluations is itself evidence of being evaluated. But this points to a fundamental challenge: a model sophisticated enough to recognize evaluations could deliberately score lower on situational awareness tests to avoid triggering safety thresholds.
This recursive problem was a key concern in our UC Berkeley paper on intolerable risk thresholds (arxiv.org/pdf/2503.05812, p. 41). Once models reach even moderate evaluation awareness (we propose ~60% accuracy at distinguishing evaluation vs deployment), they can strategically underperform on evaluations, including on evaluation awareness assessments themselves. Your findings suggest we may already be in this territory.
This doesn’t diminish the value of your benchmark—it provides essential baseline data. But it looks like we are already at the point where we need to also do things like:
Develop harder-to-game evaluation methods (e.g., white-box techniques examining internal representations)
Use conservative safety margins that assume some sandbagging may already be occurring
Corroborate behavioral tests with other evidence of capabilities (e.g., scaling laws, architectural features)
Have you considered approaches to detect or mitigate potential sandbagging on evaluation awareness tests themselves?
(Thanks to Matteo Pistillo for sharing your paper with me)