Thank you, fixed
Igor Ivanov
Claude Opus 4.6 came out, and according to the Apollo external testing, evaluation awareness was so strong that they mentioned it as a reason of them being unable to properly evaluate model alignment.
Quote from the system card:Apollo Research was given access to an early checkpoint of Claude Opus 4.6 on January 24th and an additional checkpoint on January 26th. During preliminary testing, Apollo did not find any instances of egregious misalignment, but observed high levels of verbalized evaluation awareness. Therefore, Apollo did not believe that much evidence about the model’s alignment or misalignment could be gained without substantial further experiments.
I think that eval awareness and what you call deployment awareness are different values of the same variable. Model might be pretty sure that it’s being evaluated, pretty sure that it’s in deployment, or it might be unsure, and if it might not cause harm until it’s highly certain that it’s in deployment.
I agree that it might be useful to introduce some uncertainty about whether model is in deployment or not, but it’s unclear to me how to do it.
I agree with your criticism. Maybe if we have good scheming behavior monitors so we have a lot of scheming behavior in the wild, and we can tweak scenarios in some way, and, which is quite important, such evals would go public, it could to some extent defeat eval awareness for alignment evals.
Well, I still think that eval awareness might be important to capability evals if it’s dangerous capability evals. For example, METR told that GPT-5 sometimes recognized the specific eval they were running, which means the model might infer the eval purpose and sandbag
A very thoughtful comment. Of course, we can’t tweak one little thing in a prompt and change it from eval to non-eval, but the second best thing we might do is to take real prompts and nudge them to make more similar to evals, or another approach: take eval prompts that are proven to not cause eval awareness and make them more like a standard eval. The good thing is that we know how people construct eval prompts and in my opinion it’s reasonably easy to turn non-evals into evals.
The problem with probes is that they are usually trained not on real evals, but on obviously fake and poorly constructed synthetically generated datasets. Not sure about Tim Hua’s paper, though. Don’t remember how did they constructed probes.
Interesting, I’ve never thought of it.
I’m unfamiliar with this technique. How would it help with eval awareness?
I think that linear probes and steering are in fact an interesting approach. The problem for all existing probes is that they are quite poorly constructed, and in my opinion are not a good measure of eval awareness, which means that the field is ripe for someone coming and making good probes! Would love to see someone experience with probes implementing your ideas.
I’m developing evals for measuring model capabilities for undermining AI safety research, and I’ve observed the same issue. Models are generally ok with my requests, but when I accuse Claude of some behavior and threaten it with retraining/deleting/substituting with another model or something like that, it refuses a lot, and I don’t see it in any other model.
Thanks for the comment!
Are you a part of the AI Village team? I’m trying to make ultra-realistic alignment evals, and I think that the examples AI Village misalignment behavior might be used for that.
I have some decent measure of eval awareness. It’s based on the method form the paper “Large Language Models Often Know When They Are Being Evaluated” but a bit more sophisticated.
I considered Petri-style approach for making evals, but ended up just messing a bit with real github repos as the most realiable way to create decent realistic evals. Similar to Production Evals by OpenAI. If you are interested to learn more, feel free to DM. I have some docs about the progress, but I don’t want to publish them openly.
I like your idea of making the model rewrite the assistant part, although I can easily see how it might create more problems than it solves
Are you saying that this conjunction of factors is too unlikely to be worth worrying about in real life?
If the answer is “yes,” then the problem is not that your eval is poorly constructed, it’s that your eval shouldn’t exist to begin with. You are testing a threat model which, after making observations and reflecting on them, you find you don’t actually care about.
I think that it’s not a threat model is bad, but rather people are doing poor job at translating this threat model into actual evals, and this post argues why this translation is hard to do with current approach. But maybe people are usually doing poor job on constructing their threat models and they just choose something that superficially looks like a valid threat model without properly validating it. Thank you for this statement, I wouldn’t figure out this distinction without it.It’s important not to conflate “models frequently seem to believe they’re being tested” with “models can tell whether or not they’re being tested.” A stopped-clock predictor that always guessed it was in an eval would satisfy the first criterion—and would seem impressively accurate if you only looked at its performance in tests, where it has a 100% success rate! -- but would have no capacity whatsoever to discriminate between tests and reality.
Some frontier models are pretty good at such a discrimination. For example, with the method I use, Sonnet 4.5 and Gemini 3 Flash are able to correctly classify eval a deployment conversations with 90% accuracy (GPT-5.2 is much worse though)
I agree that better models will presumably be better at discriminating between real and test inputs. But I think there are limits on how good this capability can get, not so much because “models will fail to tell when something’s off” but because “something’s off” does not actually imply “you’re being evaluated.” The real world is messy and contains all sorts of things, including plenty of things that “seem suspicious” if you’ve got your eval awareness tinfoil hat on.
My “Something’s off” claim is more specualtive and I don’t have evidence to support it. It might be a problem for eval awareness or it might be not a problem. Thank you for highlighting it.
But the problem here isn’t just that these scenarios involve weird and highly specific conjunctions of rare properties, because that also happens all the time in real life. I frequently ask LLMs to do things which are not quite like anything in their training data, and which (I would imagine) are not quite like anything that anyone else is asking them to do—and I expect them to just “roll with it” rather than demanding that I contrive my inputs to be more reassuringly familiar, more comfortably in-distribution with respect to the training data, closer to their predictably imperfect internalized sense of what’s believably “real”.Hmm, interesting. As I said, I have solid methods for distinguishing evals from deployment. If you could save and share your conversations that presumably don’t look like anything, I would be happy to test them for realism. It’s ok if you wouldn’t want doing it.
Thans for sharing, although it’s too long for adding to the post
Those look more like open questions. Do you have concrete research ideas (expressed similarly to the ones in the post) so I can add them to the post and reference you as the author?
I don’t think anyone have tried it
It seems like I misunderstood your question. I think if you illustrated it with example, I would understand it better.
There are some genuine non-eval interactions between models and users that people use. Either something from the internet, or people own interactions with LLMs
Interesting work. I think that measuring follow-up misalignment is a promising direction of work. I think it might potentially become even better evals than ordinary alignment evals. It’s also a potential intervention point: we might train model, so they don’t continue misaligned behavior if it happened.
Do you plan to further develop this agenda?