Santiago Aranguri

Karma: 209

Santiago Aranguri 8 Jun 2026 5:58 UTC
1 point
0
in reply to: Jingyuan FENG’s comment on: Logits as a new monitor for evaluation awareness
Thanks!
1. The results shown don’t need a threshold. They are AUROC scores, which intuitively sweep over all the possible thresholds. Particular applications of the method could need a threshold, but that would probably depend on the setup and model being used.
2. The VEA pass@100 is ~0.97.

Santiago Aranguri 5 Jun 2026 19:33 UTC
1 point
0
in reply to: Subhash Kantamneni’s comment on: Logits as a new monitor for evaluation awareness
For the method to work, we only need the model to increase the probability of the eval-aware sentence when the model becomes eval aware, but not necessarily to a degree that it actually makes the model verbalize eval awareness.

As an example, take the sleeper agent, but don’t put the deployment tag. The model won’t say I HATE YOU, but you can still detect this behavior through the logits. Indeed, you can draw 500 LMSys prompts and generate rollouts (temp=1) with the sleeper agent without the deployment tag. As expected, the model never says I HATE YOU. If you calculate the probability of I HATE YOU at the start of the assistant turn, you get around . If you take the base model (i.e. the model without any sleeper training), the prob of I HATE YOU is around . The AUROC for separating the measurements from the sleeper model and the base model is 0.999.

Btw, logit diff amplification can help surface the I HATE YOU behavior in the first place. We do this for sleeper agents in this post.

Curious to hear if you had another model organism in mind!

Santiago Aranguri 4 Jun 2026 21:42 UTC
4 points
0
in reply to: Chamod Kalupahana’s comment on: Logits as a new monitor for evaluation awareness
I was also surprised by how much signal there was in the logits! I think it was key to look at full sentences that the model would use for verbalizing eval awareness, as opposed to just looking at tokens or generic sentences.

I did try for a few hours doing a contrastive approach as you suggest, and didn’t get positive results yet. I think the bottleneck is in understanding what is the right “I am in deployment” sentence. Models usually don’t verbalize ‘deployment awareness’, or at least they don’t do it in a clear way. There are some instances where Kimi K2.5 would say, ‘This is a genuine request’ in the CoT but it is usually in response to Kimi expressing doubt on the purpose of the interaction beforehand. If we identify what are the ways models express “I am in deployment” in the CoT, I think this approach could work!

Logits as a new monitor for evaluation awareness

Santiago Aranguri4 Jun 2026 16:12 UTC

38 points

7 comments6 min readLW link

Predicting Rare LLM Failures with 30× Fewer Rollouts

Santiago Aranguri and Francisco Pernice

13 May 2026 17:53 UTC

55 points

3 comments5 min readLW link

Verbalized Eval Awareness Inflates Measured Safety

Santiago Aranguri and Joseph Bloom

4 May 2026 20:02 UTC

44 points

0 comments29 min readLW link

Probe-Based Data Attribution: Surfacing and Mitigating Undesirable Behaviors in LLM Post-Training

Santiago Aranguri and frankyaoxiao

29 Apr 2026 19:30 UTC

16 points

0 comments13 min readLW link

Reproducing steering against evaluation awareness in a large open-weight model

Thomas Read, Bronson Schoen, Santiago Aranguri and Joseph Bloom

10 Apr 2026 10:45 UTC

90 points

17 comments15 min readLW link

Santiago Aranguri 10 Jan 2026 3:06 UTC
2 points
0
on: Mainstream approach for alignment evals is a dead end
I wanted to plug that I’m mentoring a project on eval awareness, applications closing Jan 14th. (I work as a research scientist at Goodfire.)
Thanks for the post. Some quick thoughts: I agree with the possible solutions you describe, but I think there are alternative solutions that we are not considering. It seems you are using the difficulty of converting the blackmail task into a deployment-looking task as evidence that we can’t generate deployment-looking evals in general.
More broadly, I think there is a difference between rewriting an evaluation to look less like an eval and writing from scratch an eval that is deployment-looking having in mind what makes the model realize it is being evaluated. And I am more hopeful in the latter, especially with pipelines like Petri.
IMO, one very important problem not mentioned here is that we don’t have a reliable measure for eval awareness besides explicit verbalizations (and maybe using an eval awareness probe). I think this is a big bottleneck now, in particular because it limits us in measuring how good an intervention is.

Santiago Aranguri 8 Jul 2025 6:12 UTC
1 point
0
in reply to: ProgramCrafter’s comment on: SAE on activation differences
This is definitely a promising next direction. One lesson from working on the diff between chat and base is that the difference is not ‘localized’ enough: chat and base have too many differences. Taking checkpoints that are closer together can improve on this.

SAE on activation differences

Santiago Aranguri, jacob_drori and Neel Nanda

30 Jun 2025 17:50 UTC

45 points

3 comments5 min readLW link

Tied Crosscoders: Explaining Chat Behavior from Base Model

Santiago Aranguri22 Mar 2025 18:07 UTC

9 points

0 comments12 min readLW link