RSS

andrq

Karma: 92

Steer­ing Eval­u­a­tion-Aware Models to Act Like They Are Deployed

30 Oct 2025 15:03 UTC
61 points
12 comments16 min readLW link

Dis­cov­er­ing Back­door Triggers

19 Aug 2025 6:24 UTC
57 points
4 comments13 min readLW link