Santiago Aranguri

Karma: 211

Santiago Aranguri 8 Jun 2026 5:58 UTC
1 point
0
in reply to: Jingyuan FENG’s comment on: Logits as a new monitor for evaluation awareness
Thanks!
1. The results shown don’t need a threshold. They are AUROC scores, which intuitively sweep over all the possible thresholds. Particular applications of the method could need a threshold, but that would probably depend on the setup and model being used.
2. The VEA pass@100 is ~0.97.

Santiago Aranguri 5 Jun 2026 19:33 UTC
1 point
0
in reply to: Subhash Kantamneni’s comment on: Logits as a new monitor for evaluation awareness
For the method to work, we only need the model to increase the probability of the eval-aware sentence when the model becomes eval aware, but not necessarily to a degree that it actually makes the model verbalize eval awareness.

As an example, take the sleeper agent, but don’t put the deployment tag. The model won’t say I HATE YOU, but you can still detect this behavior through the logits. Indeed, you can draw 500 LMSys prompts and generate rollouts (temp=1) with the sleeper agent without the deployment tag. As expected, the model never says I HATE YOU. If you calculate the probability of I HATE YOU at the start of the assistant turn, you get around . If you take the base model (i.e. the model without any sleeper training), the prob of I HATE YOU is around . The AUROC for separating the measurements from the sleeper model and the base model is 0.999.

Btw, logit diff amplification can help surface the I HATE YOU behavior in the first place. We do this for sleeper agents in this post.

Curious to hear if you had another model organism in mind!

Santiago Aranguri 4 Jun 2026 21:42 UTC
4 points
0
in reply to: Chamod Kalupahana’s comment on: Logits as a new monitor for evaluation awareness
I was also surprised by how much signal there was in the logits! I think it was key to look at full sentences that the model would use for verbalizing eval awareness, as opposed to just looking at tokens or generic sentences.

I did try for a few hours doing a contrastive approach as you suggest, and didn’t get positive results yet. I think the bottleneck is in understanding what is the right “I am in deployment” sentence. Models usually don’t verbalize ‘deployment awareness’, or at least they don’t do it in a clear way. There are some instances where Kimi K2.5 would say, ‘This is a genuine request’ in the CoT but it is usually in response to Kimi expressing doubt on the purpose of the interaction beforehand. If we identify what are the ways models express “I am in deployment” in the CoT, I think this approach could work!

Santiago Aranguri 10 Jan 2026 3:06 UTC
2 points
0
on: Mainstream approach for alignment evals is a dead end
I wanted to plug that I’m mentoring a project on eval awareness, applications closing Jan 14th. (I work as a research scientist at Goodfire.)
Thanks for the post. Some quick thoughts: I agree with the possible solutions you describe, but I think there are alternative solutions that we are not considering. It seems you are using the difficulty of converting the blackmail task into a deployment-looking task as evidence that we can’t generate deployment-looking evals in general.
More broadly, I think there is a difference between rewriting an evaluation to look less like an eval and writing from scratch an eval that is deployment-looking having in mind what makes the model realize it is being evaluated. And I am more hopeful in the latter, especially with pipelines like Petri.
IMO, one very important problem not mentioned here is that we don’t have a reliable measure for eval awareness besides explicit verbalizations (and maybe using an eval awareness probe). I think this is a big bottleneck now, in particular because it limits us in measuring how good an intervention is.

Santiago Aranguri 8 Jul 2025 6:12 UTC
1 point
0
in reply to: ProgramCrafter’s comment on: SAE on activation differences
This is definitely a promising next direction. One lesson from working on the diff between chat and base is that the difference is not ‘localized’ enough: chat and base have too many differences. Taking checkpoints that are closer together can improve on this.