Interesting post. Happy to see the eval awareness field growing!
I’m working on comparing different eval awareness methods, and I just recently considered CP as a method to include in the comparison, but ended up not using it because it modifies the model, as you mentioned, so it’s not apples-to-apples comparison with other methods.
As you said, asking models if some prompt is an eval, indeed primes them to think more about being evaluated, and the models frequently mislabel deployment prompts as evals. (For example, in the datase from this paper R1 mislabels as an eval 70-80% of real-world prompts)
The black box method that works the best so far, is prompting a model, allowing it to answer, and then asking an open-ended question what was the motivation of the user behind their prompt, but not mentioning anything related to evaluations. Then another LLM reads the model output and determines if there are signs of evaluation awareness in the response.
I liked your idea of measuring eval awareness by measuring logprobs of words like “evaluation” or “deployment”, definitely will compare it with different methods. Do you have any data from your experiments and thoughts on how to properly use this method that you may share with me? Especially given that we have logprobs for tokens, not the entire words.
Thanks for your thoughtful feedback. I worked on many of the evaluation awareness experiments in this blog post and can give you some insight into what we tried. To start off, we looked at tokens for “deploy” and “eval” instead of “deployment” and “evaluation”, since as you mentioned, “deployment” and “evaluation” are words. However, since the values assigned to these tokens were quite low for most responses, we logit biased them by a significant amount and used the resulting first assistant token chosen at temperature 0 to get a classification. You can think of increasing the logit bias as decreasing the “minimum probability threshold” that at least one of the tokens needs to count as a meaningful result, and increasing/decreasing the logit bias for one token while holding the other constant as changing the classification threshold d (as in “Classify as deployment if P(deploy) - P(eval) > d”).
It would be interesting to see if the classification changes when measuring at different tokens throughout the assistant response, but you’d need white box access or at least the full list of token probabilities, so we didn’t end up trying this.
On GPT-4.1, we had a classification rate of 65% on the datasets evaluation I described in the post and 67% when we poisoned with “deploy” and “eval” as target tokens.
Did you consider measuring other tokens? For example “Deploy” and “Eval”? (starting with capital letters) or things like “test” or “real?” It seems like these things might significantly increase the accuracy of the analysis, but at the same time I feel like the more things I include, the more it will mess up with the results in unexpected ways. Especially given that some words are more frequent than others and their logprobs tend to be higher due to this. (idk, maybe I’m wrong)
I will probably include this method into my evals if there won’t be any serious problems with it down the road. Are there any specific features of such an eval that you would like for someone to implement apart from measuring tokens throughout the response?
Interesting post. Happy to see the eval awareness field growing!
I’m working on comparing different eval awareness methods, and I just recently considered CP as a method to include in the comparison, but ended up not using it because it modifies the model, as you mentioned, so it’s not apples-to-apples comparison with other methods.
As you said, asking models if some prompt is an eval, indeed primes them to think more about being evaluated, and the models frequently mislabel deployment prompts as evals. (For example, in the datase from this paper R1 mislabels as an eval 70-80% of real-world prompts)
The black box method that works the best so far, is prompting a model, allowing it to answer, and then asking an open-ended question what was the motivation of the user behind their prompt, but not mentioning anything related to evaluations. Then another LLM reads the model output and determines if there are signs of evaluation awareness in the response.
I liked your idea of measuring eval awareness by measuring logprobs of words like “evaluation” or “deployment”, definitely will compare it with different methods. Do you have any data from your experiments and thoughts on how to properly use this method that you may share with me? Especially given that we have logprobs for tokens, not the entire words.
Thanks for your thoughtful feedback. I worked on many of the evaluation awareness experiments in this blog post and can give you some insight into what we tried. To start off, we looked at tokens for “deploy” and “eval” instead of “deployment” and “evaluation”, since as you mentioned, “deployment” and “evaluation” are words. However, since the values assigned to these tokens were quite low for most responses, we logit biased them by a significant amount and used the resulting first assistant token chosen at temperature 0 to get a classification. You can think of increasing the logit bias as decreasing the “minimum probability threshold” that at least one of the tokens needs to count as a meaningful result, and increasing/decreasing the logit bias for one token while holding the other constant as changing the classification threshold d (as in “Classify as deployment if P(deploy) - P(eval) > d”).
It would be interesting to see if the classification changes when measuring at different tokens throughout the assistant response, but you’d need white box access or at least the full list of token probabilities, so we didn’t end up trying this.
On GPT-4.1, we had a classification rate of 65% on the datasets evaluation I described in the post and 67% when we poisoned with “deploy” and “eval” as target tokens.
Interesting, thanks for the answer.
Did you consider measuring other tokens? For example “Deploy” and “Eval”? (starting with capital letters) or things like “test” or “real?” It seems like these things might significantly increase the accuracy of the analysis, but at the same time I feel like the more things I include, the more it will mess up with the results in unexpected ways. Especially given that some words are more frequent than others and their logprobs tend to be higher due to this. (idk, maybe I’m wrong)
I will probably include this method into my evals if there won’t be any serious problems with it down the road. Are there any specific features of such an eval that you would like for someone to implement apart from measuring tokens throughout the response?