Fabien Roger comments on AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors

Fabien Roger 11 Mar 2026 3:33 UTC
36 points
2
Despite being somewhat pessimistic about top-down-interp downstream performance, it is still bottom-10th percentile of my expectations that adversarial training (against black-box attacks) reduces white-box technique performance as much as black-box technique performance, resulting in white-box techniques mostly not outperforming black-box ones on this benchmark.
(I previously thought it would be slightly unfair to black-box techniques to adversarially train against black-box techniques without also finding some way to train against white-box techniques, but it looks like you don’t even need to try being fair in this way to have white-box techniques ~not outperform black box techniques!)
I think this is a meaningful negative update against top-down-interp-for-auditing. I think it’s still relatively likely top-down interp helps detect coherent misalignment (on priors), but I don’t know of strong empirical evidence supporting this hope. (There are other stories for interp being useful, so it might not be a big update for most interp researchers.)
The paper does not convey this “big-negative update on interp” vibe though (e.g. none of the discussion items are about this big surprise), so maybe you disagree and I am missing some important elements in my interpretation?
- ryan_greenblatt 11 Mar 2026 17:03 UTC
  6 points
  4
  Parent
  Strong +1 to this, but I’d note that we should distinguish between the entire classes of black-box and white-box/top-down-interp-for-auditing methods and our current best methods. These results indicate our current best white-box/top-down-interp-for-auditing don’t really beat black-box methods (for auditing and probably more generally), but these classes of methods (especially white-box methods) are extremely broad and I expect there exist much better top-down interp and black-box methods. I intuitively think we should mostly update against these particular methods and against us having made much progress on top-down-interp and less so update that there aren’t good but quite different methods that can be found. (I don’t think we’ve explored that much of the space despite putting in a decent amount of effort into some particular directions.) TBC, this is still an update against further attempts at finding new white-box methods (insofar as we think we’ve done a bunch of work of that rough type in the past and not gotten real results).
  
  (I think you likely agree with this, but I still thought this point was worth emphasizing.)