Zach Stein-Perlman comments on Zach Stein-Perlman’s Shortform

Zach Stein-Perlman 17 Jul 2025 19:46 UTC
LW: 11 AF: 5
2
AF
I want to distinguish (1) finding undesired behaviors or goals from (2) catching actual attempts to subvert safety techniques or attack the company. I claim the posts you cite are about (2). I agree with those posts that (2) would be very helpful. I don’t think that’s what alignment auditing work is aiming at.^[1] (And I think lower-hanging fruit for (2) is improving monitoring during deployment plus some behavioral testing in (fake) high-stakes situations.)
1. ^
  The AI “brain scan” hope definitely isn’t like this
  I don’t think the alignment auditing paper is like this, but related things could be