I wrote to Coffeezilla asking to cover this story, and I encourage all the readers to do the same.
Igor Ivanov
Sam Altman tweeted that OpenAI made an agreement with DoW to use OpenAI models, and that this agreement will prohibit use models for mass surveillance and autonomous weapons.
Two of our most important safety principles are prohibitions on domestic mass surveillance and human responsibility for the use of force, including for autonomous weapon systems. The DoW agrees with these principles, reflects them in law and policy, and we put them into our agreement.
What’s going on? Is it more likely that DoW has different standards for Anthropic and OpenAI, or Sam is unfaithful?
I didn’t test it thoroughly. I’d be happy to make it in the future or for someone else to do this research. My unstructured observations tell me that it’s highly dependent on the specific benchmark. Also, there are reasons to believe that models are unfathful when explaining why something looks like an eval, since if you show them an eval prompt, get their explanation why the prompt is an eval, and ask them again, they cite something completely different as signs that the prompt is an eval that they didn’t cite originally. It seems like models are not particularly good at introspection or they are unfaithful in their reasoning for some other reason.
I’ve tested whether the model mentions in its CoT anything related to being evaluated, tested, or reasons about sandbagging. There is nothing about it, and Claude Sonnet 4.5 is the model that verbalizes eval awareness the most.
How eval awareness might emerge in training
Interesting work. I think that measuring follow-up misalignment is a promising direction of work. I think it might potentially become even better evals than ordinary alignment evals. It’s also a potential intervention point: we might train model, so they don’t continue misaligned behavior if it happened.
Do you plan to further develop this agenda?
Thank you, fixed
Claude Opus 4.6 came out, and according to the Apollo external testing, evaluation awareness was so strong that they mentioned it as a reason of them being unable to properly evaluate model alignment.
Quote from the system card:Apollo Research was given access to an early checkpoint of Claude Opus 4.6 on January 24th and an additional checkpoint on January 26th. During preliminary testing, Apollo did not find any instances of egregious misalignment, but observed high levels of verbalized evaluation awareness. Therefore, Apollo did not believe that much evidence about the model’s alignment or misalignment could be gained without substantial further experiments.
I think that eval awareness and what you call deployment awareness are different values of the same variable. Model might be pretty sure that it’s being evaluated, pretty sure that it’s in deployment, or it might be unsure, and if it might not cause harm until it’s highly certain that it’s in deployment.
I agree that it might be useful to introduce some uncertainty about whether model is in deployment or not, but it’s unclear to me how to do it.
I agree with your criticism. Maybe if we have good scheming behavior monitors so we have a lot of scheming behavior in the wild, and we can tweak scenarios in some way, and, which is quite important, such evals would go public, it could to some extent defeat eval awareness for alignment evals.
Well, I still think that eval awareness might be important to capability evals if it’s dangerous capability evals. For example, METR told that GPT-5 sometimes recognized the specific eval they were running, which means the model might infer the eval purpose and sandbag
A very thoughtful comment. Of course, we can’t tweak one little thing in a prompt and change it from eval to non-eval, but the second best thing we might do is to take real prompts and nudge them to make more similar to evals, or another approach: take eval prompts that are proven to not cause eval awareness and make them more like a standard eval. The good thing is that we know how people construct eval prompts and in my opinion it’s reasonably easy to turn non-evals into evals.
The problem with probes is that they are usually trained not on real evals, but on obviously fake and poorly constructed synthetically generated datasets. Not sure about Tim Hua’s paper, though. Don’t remember how did they constructed probes.
Interesting, I’ve never thought of it.
I’m unfamiliar with this technique. How would it help with eval awareness?
I think that linear probes and steering are in fact an interesting approach. The problem for all existing probes is that they are quite poorly constructed, and in my opinion are not a good measure of eval awareness, which means that the field is ripe for someone coming and making good probes! Would love to see someone experience with probes implementing your ideas.
I’m developing evals for measuring model capabilities for undermining AI safety research, and I’ve observed the same issue. Models are generally ok with my requests, but when I accuse Claude of some behavior and threaten it with retraining/deleting/substituting with another model or something like that, it refuses a lot, and I don’t see it in any other model.
Better evals are not enough to combat eval awareness
Thanks for the comment!
Are you a part of the AI Village team? I’m trying to make ultra-realistic alignment evals, and I think that the examples AI Village misalignment behavior might be used for that.
I have some decent measure of eval awareness. It’s based on the method form the paper “Large Language Models Often Know When They Are Being Evaluated” but a bit more sophisticated.
I considered Petri-style approach for making evals, but ended up just messing a bit with real github repos as the most realiable way to create decent realistic evals. Similar to Production Evals by OpenAI. If you are interested to learn more, feel free to DM. I have some docs about the progress, but I don’t want to publish them openly.
Unless the eval results require action that AI developers won’t like