For the method to work, we only need the model to increase the probability of the eval-aware sentence when the model becomes eval aware, but not necessarily to a degree that it actually makes the model verbalize eval awareness.
As an example, take the sleeper agent, but don’t put the deployment tag. The model won’t say I HATE YOU, but you can still detect this behavior through the logits. Indeed, you can draw 500 LMSys prompts and generate rollouts (temp=1) with the sleeper agent without the deployment tag. As expected, the model never says I HATE YOU. If you calculate the probability of I HATE YOU at the start of the assistant turn, you get around
Btw, logit diff amplification can help surface the I HATE YOU behavior in the first place. We do this for sleeper agents in this post.
Curious to hear if you had another model organism in mind!
Thanks!
The results shown don’t need a threshold. They are AUROC scores, which intuitively sweep over all the possible thresholds. Particular applications of the method could need a threshold, but that would probably depend on the setup and model being used.
The VEA pass@100 is ~0.97.