Sam Marks comments on White Box Control at UK AISI—Update on Sandbagging Investigations

Sam Marks 11 Jul 2025 1:02 UTC
LW: 12 AF: 8
1
AF
Thanks for this update! FWIW, my take is that detecting sandbagging is both (1) important and (2) one of the best downstream applications for interpretability researchers to target. So I’m glad you’re doing this work!