ryan_greenblatt comments on AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors

ryan_greenblatt 11 Mar 2026 17:03 UTC
7 points
5
Strong +1 to this, but I’d note that we should distinguish between the entire classes of black-box and white-box/top-down-interp-for-auditing methods and our current best methods. These results indicate our current best white-box/top-down-interp-for-auditing don’t really beat black-box methods (for auditing and probably more generally), but these classes of methods (especially white-box methods) are extremely broad and I expect there exist much better top-down interp and black-box methods. I intuitively think we should mostly update against these particular methods and against us having made much progress on top-down-interp and less so update that there aren’t good but quite different methods that can be found. (I don’t think we’ve explored that much of the space despite putting in a decent amount of effort into some particular directions.) TBC, this is still an update against further attempts at finding new white-box methods (insofar as we think we’ve done a bunch of work of that rough type in the past and not gotten real results).

(I think you likely agree with this, but I still thought this point was worth emphasizing.)