Santiago Aranguri

Karma: 200

Logits as a new monitor for evaluation awareness

Santiago Aranguri4 Jun 2026 16:12 UTC

34 points

7 comments6 min readLW link

Predicting Rare LLM Failures with 30× Fewer Rollouts

Santiago Aranguri and Francisco Pernice

13 May 2026 17:53 UTC

55 points

3 comments5 min readLW link

Verbalized Eval Awareness Inflates Measured Safety

Santiago Aranguri and Joseph Bloom

4 May 2026 20:02 UTC

40 points

0 comments29 min readLW link

Probe-Based Data Attribution: Surfacing and Mitigating Undesirable Behaviors in LLM Post-Training

Santiago Aranguri and frankyaoxiao

29 Apr 2026 19:30 UTC

16 points

0 comments13 min readLW link

Reproducing steering against evaluation awareness in a large open-weight model

Thomas Read, Bronson Schoen, Santiago Aranguri and Joseph Bloom

10 Apr 2026 10:45 UTC

89 points

17 comments15 min readLW link

SAE on activation differences

Santiago Aranguri, jacob_drori and Neel Nanda

30 Jun 2025 17:50 UTC

45 points

3 comments5 min readLW link

Tied Crosscoders: Explaining Chat Behavior from Base Model

Santiago Aranguri22 Mar 2025 18:07 UTC

9 points

0 comments12 min readLW link