One thing that follows from your text is that models might start “poking around” more than necessary to learn more about the environment to determine whether it’s eval or real, and if we catch this behavior, it might be a sign of the problem you describe.
On the other hand, I observe in my research that it’s much easier to make realistic chat evals than agentic evals, and text-based agentic evals are easier to make realistic than the ones that involve internet usage or navigating a computer screen, which means that models don’t necessarily need to “poke around” more than necessary to spot evals if the environment is sophisticated enough.
Igor Ivanov
In the Figure 6 you only show results of eval gaming for Fortress, but there is some work that shows lack of eval gaming on other benchmarks. Have you observed similar eval gaming on other evals as well?
Good work. I’ve recently made a highly-realistic alignment benchmark, and when I degraded its realism, it didn’t change model behavior. Although another paper that did a similar thing, it affected models’behavior.
Thank you for your feedback. We did use prompts that are based on the Needham prompts. There is a recent research that compares different methods for measuring eval awareness, including a one similar to what you have described. Short answer, it seems like our method and the method you proposed measure a similar thing but our method is more sensitive to eval awareness.
I guess models expexting evaluations all the time are not that bad. What we want is for models to not being able to distinguish evals from non evals, so the evals are informative about model behavior outside of evals, and if they suspect being evaluated in deployment, that’s an acceptable byproduct
I’ve recently published a somewhat similar scheming evals to combat eval awareness and was able to make them indstinguishable from production data https://www.lesswrong.com/posts/WKuGzrtCnAAArjj2N/lure-alignment-evaluations-to-reduce-evaluation-awareness
The direction that these two methods point out might be similar, but I found, for example, that Petri anti-eval-awareness measures are quite good at combating verbalized awareness, but when these transcript are checked against the Needham method, it shows that models still see that this is an evaluation. They optimized their eval against verbalized awareness, but failed to optimize against unverbalized one/
Interesting results.
Have you considered measuring not just verbalized awareness, but also other methods for measuring eval awareness, like from the Needham paper? It seems like a big chunk of eval awareness is not being verbalized
One big reason why people do evals is that they see evals as completely cooperative and non-adversarial
Unless the eval results require action that AI developers won’t like
I wrote to Coffeezilla asking to cover this story, and I encourage all the readers to do the same.
Sam Altman tweeted that OpenAI made an agreement with DoW to use OpenAI models, and that this agreement will prohibit use models for mass surveillance and autonomous weapons.
Two of our most important safety principles are prohibitions on domestic mass surveillance and human responsibility for the use of force, including for autonomous weapon systems. The DoW agrees with these principles, reflects them in law and policy, and we put them into our agreement.
What’s going on? Is it more likely that DoW has different standards for Anthropic and OpenAI, or Sam is unfaithful?
I didn’t test it thoroughly. I’d be happy to make it in the future or for someone else to do this research. My unstructured observations tell me that it’s highly dependent on the specific benchmark. Also, there are reasons to believe that models are unfathful when explaining why something looks like an eval, since if you show them an eval prompt, get their explanation why the prompt is an eval, and ask them again, they cite something completely different as signs that the prompt is an eval that they didn’t cite originally. It seems like models are not particularly good at introspection or they are unfaithful in their reasoning for some other reason.
I’ve tested whether the model mentions in its CoT anything related to being evaluated, tested, or reasons about sandbagging. There is nothing about it, and Claude Sonnet 4.5 is the model that verbalizes eval awareness the most.
Interesting work. I think that measuring follow-up misalignment is a promising direction of work. I think it might potentially become even better evals than ordinary alignment evals. It’s also a potential intervention point: we might train model, so they don’t continue misaligned behavior if it happened.
Do you plan to further develop this agenda?
Thank you, fixed
Claude Opus 4.6 came out, and according to the Apollo external testing, evaluation awareness was so strong that they mentioned it as a reason of them being unable to properly evaluate model alignment.
Quote from the system card:Apollo Research was given access to an early checkpoint of Claude Opus 4.6 on January 24th and an additional checkpoint on January 26th. During preliminary testing, Apollo did not find any instances of egregious misalignment, but observed high levels of verbalized evaluation awareness. Therefore, Apollo did not believe that much evidence about the model’s alignment or misalignment could be gained without substantial further experiments.
I think that eval awareness and what you call deployment awareness are different values of the same variable. Model might be pretty sure that it’s being evaluated, pretty sure that it’s in deployment, or it might be unsure, and if it might not cause harm until it’s highly certain that it’s in deployment.
I agree that it might be useful to introduce some uncertainty about whether model is in deployment or not, but it’s unclear to me how to do it.
I agree with your criticism. Maybe if we have good scheming behavior monitors so we have a lot of scheming behavior in the wild, and we can tweak scenarios in some way, and, which is quite important, such evals would go public, it could to some extent defeat eval awareness for alignment evals.
Well, I still think that eval awareness might be important to capability evals if it’s dangerous capability evals. For example, METR told that GPT-5 sometimes recognized the specific eval they were running, which means the model might infer the eval purpose and sandbag
A very thoughtful comment. Of course, we can’t tweak one little thing in a prompt and change it from eval to non-eval, but the second best thing we might do is to take real prompts and nudge them to make more similar to evals, or another approach: take eval prompts that are proven to not cause eval awareness and make them more like a standard eval. The good thing is that we know how people construct eval prompts and in my opinion it’s reasonably easy to turn non-evals into evals.
And real high-stakes environments, in which models can do a lot of harm, are probably always reach-context