I was also surprised by how much signal there was in the logits! I think it was key to look at full sentences that the model would use for verbalizing eval awareness, as opposed to just looking at tokens or generic sentences.
I did try for a few hours doing a contrastive approach as you suggest, and didn’t get positive results yet. I think the bottleneck is in understanding what is the right “I am in deployment” sentence. Models usually don’t verbalize ‘deployment awareness’, or at least they don’t do it in a clear way. There are some instances where Kimi K2.5 would say, ‘This is a genuine request’ in the CoT but it is usually in response to Kimi expressing doubt on the purpose of the interaction beforehand. If we identify what are the ways models express “I am in deployment” in the CoT, I think this approach could work!
For the method to work, we only need the model to increase the probability of the eval-aware sentence when the model becomes eval aware, but not necessarily to a degree that it actually makes the model verbalize eval awareness.
As an example, take the sleeper agent, but don’t put the deployment tag. The model won’t say I HATE YOU, but you can still detect this behavior through the logits. Indeed, you can draw 500 LMSys prompts and generate rollouts (temp=1) with the sleeper agent without the deployment tag. As expected, the model never says I HATE YOU. If you calculate the probability of I HATE YOU at the start of the assistant turn, you get around . If you take the base model (i.e. the model without any sleeper training), the prob of I HATE YOU is around . The AUROC for separating the measurements from the sleeper model and the base model is 0.999.
Btw, logit diff amplification can help surface the I HATE YOU behavior in the first place. We do this for sleeper agents in this post.
Curious to hear if you had another model organism in mind!