RSS

Santiago Aranguri

Karma: 200

Log­its as a new mon­i­tor for eval­u­a­tion awareness

Santiago Aranguri4 Jun 2026 16:12 UTC
30 points
4 comments6 min readLW link

Pre­dict­ing Rare LLM Failures with 30× Fewer Rollouts

13 May 2026 17:53 UTC
55 points
3 comments5 min readLW link

Ver­bal­ized Eval Aware­ness In­flates Mea­sured Safety

4 May 2026 20:02 UTC
40 points
0 comments29 min readLW link

Probe-Based Data At­tri­bu­tion: Sur­fac­ing and Miti­gat­ing Un­de­sir­able Be­hav­iors in LLM Post-Training

29 Apr 2026 19:30 UTC
16 points
0 comments13 min readLW link

Re­pro­duc­ing steer­ing against eval­u­a­tion aware­ness in a large open-weight model

10 Apr 2026 10:45 UTC
89 points
17 comments15 min readLW link

SAE on ac­ti­va­tion differences

30 Jun 2025 17:50 UTC
45 points
3 comments5 min readLW link

Tied Cross­coders: Ex­plain­ing Chat Be­hav­ior from Base Model

Santiago Aranguri22 Mar 2025 18:07 UTC
9 points
0 comments12 min readLW link