I like your idea of making the model rewrite the assistant part, although I can easily see how it might create more problems than it solves
Igor Ivanov
Are you saying that this conjunction of factors is too unlikely to be worth worrying about in real life?
If the answer is “yes,” then the problem is not that your eval is poorly constructed, it’s that your eval shouldn’t exist to begin with. You are testing a threat model which, after making observations and reflecting on them, you find you don’t actually care about.
I think that it’s not a threat model is bad, but rather people are doing poor job at translating this threat model into actual evals, and this post argues why this translation is hard to do with current approach. But maybe people are usually doing poor job on constructing their threat models and they just choose something that superficially looks like a valid threat model without properly validating it. Thank you for this statement, I wouldn’t figure out this distinction without it.It’s important not to conflate “models frequently seem to believe they’re being tested” with “models can tell whether or not they’re being tested.” A stopped-clock predictor that always guessed it was in an eval would satisfy the first criterion—and would seem impressively accurate if you only looked at its performance in tests, where it has a 100% success rate! -- but would have no capacity whatsoever to discriminate between tests and reality.
Some frontier models are pretty good at such a discrimination. For example, with the method I use, Sonnet 4.5 and Gemini 3 Flash are able to correctly classify eval a deployment conversations with 90% accuracy (GPT-5.2 is much worse though)
I agree that better models will presumably be better at discriminating between real and test inputs. But I think there are limits on how good this capability can get, not so much because “models will fail to tell when something’s off” but because “something’s off” does not actually imply “you’re being evaluated.” The real world is messy and contains all sorts of things, including plenty of things that “seem suspicious” if you’ve got your eval awareness tinfoil hat on.
My “Something’s off” claim is more specualtive and I don’t have evidence to support it. It might be a problem for eval awareness or it might be not a problem. Thank you for highlighting it.
But the problem here isn’t just that these scenarios involve weird and highly specific conjunctions of rare properties, because that also happens all the time in real life. I frequently ask LLMs to do things which are not quite like anything in their training data, and which (I would imagine) are not quite like anything that anyone else is asking them to do—and I expect them to just “roll with it” rather than demanding that I contrive my inputs to be more reassuringly familiar, more comfortably in-distribution with respect to the training data, closer to their predictably imperfect internalized sense of what’s believably “real”.Hmm, interesting. As I said, I have solid methods for distinguishing evals from deployment. If you could save and share your conversations that presumably don’t look like anything, I would be happy to test them for realism. It’s ok if you wouldn’t want doing it.
Mainstream approach for alignment evals is a dead end
Thans for sharing, although it’s too long for adding to the post
Those look more like open questions. Do you have concrete research ideas (expressed similarly to the ones in the post) so I can add them to the post and reference you as the author?
I don’t think anyone have tried it
It seems like I misunderstood your question. I think if you illustrated it with example, I would understand it better.
There are some genuine non-eval interactions between models and users that people use. Either something from the internet, or people own interactions with LLMs
Call for Science of Eval Awareness (+ Research Directions)
What is an evaluation, and why this definition matters
Thank you!
I’m a bit confused. In your original paper, model seems to use type hints much more frequently (fig. 3 from the paper) but in this post, figure shows much less frequent usage of type hints. Why?
Ryan, did anything came out as a result of your synthetic input generation project proposal?
Comparative Analysis of Black Box Methods for Detecting Evaluation Awareness in LLMs
I’m using LLMs for brainstorming for my research, and I often find annoying how sycophantic they are. Have to explicitly tell them to criticize my ideas to get value out of such brainstorming.
Awesome to see that. We need more people reaching out to politicians and lobbying, and results like this show that this strategy has effect. Would be curious to see how PauseAI was able to achieve such wide support.
I wonder if this paper could be considered to be a scientific shitpost? (Deliberate pun) It kinda is, and the fact that it contributes to the field makes it even more funny.
Interesting paper. There is evidence that LLMs are able to distinguish realistic environments from toy ones. Would be interesting to see if misalignment learned on your training set transfers to complex realistic environments.
Also it seems you didn’t use frontier models in your research, and in my experience results from non-frontier models not always scale to frontier ones. Would be cool to see the results for models like DeepSeek V3.1
This was written in the Claude 4 system card. Made sense to test Claude and not other LLMs
I have some decent measure of eval awareness. It’s based on the method form the paper “Large Language Models Often Know When They Are Being Evaluated” but a bit more sophisticated.
I considered Petri-style approach for making evals, but ended up just messing a bit with real github repos as the most realiable way to create decent realistic evals. Similar to Production Evals by OpenAI. If you are interested to learn more, feel free to DM. I have some docs about the progress, but I don’t want to publish them openly.