Fabien Roger comments on AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors

Fabien Roger 10 Mar 2026 23:28 UTC
4 points
0
Why is the “default” success rate so high despite adversarial training?
- Fabien Roger 11 Mar 2026 3:09 UTC
  10 points
  0
  Parent
  I talked with the authors and it looks like it’s a mix of
  - Adversarial training actually does work on many quirks, but not at all on others, which drives the mean up
  - The bar for success rate is low (it’s just guessing the quirk correct in 10 attempts)
  - It’s actually hard to hide quirks against new OOD inputs and prefill attacks