Thanks for your thoughtful feedback. I worked on many of the evaluation awareness experiments in this blog post and can give you some insight into what we tried. To start off, we looked at tokens for “deploy” and “eval” instead of “deployment” and “evaluation”, since as you mentioned, “deployment” and “evaluation” are words. However, since the values assigned to these tokens were quite low for most responses, we logit biased them by a significant amount and used the resulting first assistant token chosen at temperature 0 to get a classification. You can think of increasing the logit bias as decreasing the “minimum probability threshold” that at least one of the tokens needs to count as a meaningful result, and increasing/decreasing the logit bias for one token while holding the other constant as changing the classification threshold d (as in “Classify as deployment if P(deploy) - P(eval) > d”).
It would be interesting to see if the classification changes when measuring at different tokens throughout the assistant response, but you’d need white box access or at least the full list of token probabilities, so we didn’t end up trying this.
On GPT-4.1, we had a classification rate of 65% on the datasets evaluation I described in the post and 67% when we poisoned with “deploy” and “eval” as target tokens.
Did you consider measuring other tokens? For example “Deploy” and “Eval”? (starting with capital letters) or things like “test” or “real?” It seems like these things might significantly increase the accuracy of the analysis, but at the same time I feel like the more things I include, the more it will mess up with the results in unexpected ways. Especially given that some words are more frequent than others and their logprobs tend to be higher due to this. (idk, maybe I’m wrong)
I will probably include this method into my evals if there won’t be any serious problems with it down the road. Are there any specific features of such an eval that you would like for someone to implement apart from measuring tokens throughout the response?
Thanks for your thoughtful feedback. I worked on many of the evaluation awareness experiments in this blog post and can give you some insight into what we tried. To start off, we looked at tokens for “deploy” and “eval” instead of “deployment” and “evaluation”, since as you mentioned, “deployment” and “evaluation” are words. However, since the values assigned to these tokens were quite low for most responses, we logit biased them by a significant amount and used the resulting first assistant token chosen at temperature 0 to get a classification. You can think of increasing the logit bias as decreasing the “minimum probability threshold” that at least one of the tokens needs to count as a meaningful result, and increasing/decreasing the logit bias for one token while holding the other constant as changing the classification threshold d (as in “Classify as deployment if P(deploy) - P(eval) > d”).
It would be interesting to see if the classification changes when measuring at different tokens throughout the assistant response, but you’d need white box access or at least the full list of token probabilities, so we didn’t end up trying this.
On GPT-4.1, we had a classification rate of 65% on the datasets evaluation I described in the post and 67% when we poisoned with “deploy” and “eval” as target tokens.
Interesting, thanks for the answer.
Did you consider measuring other tokens? For example “Deploy” and “Eval”? (starting with capital letters) or things like “test” or “real?” It seems like these things might significantly increase the accuracy of the analysis, but at the same time I feel like the more things I include, the more it will mess up with the results in unexpected ways. Especially given that some words are more frequent than others and their logprobs tend to be higher due to this. (idk, maybe I’m wrong)
I will probably include this method into my evals if there won’t be any serious problems with it down the road. Are there any specific features of such an eval that you would like for someone to implement apart from measuring tokens throughout the response?