Why is the “default” success rate so high despite adversarial training?
I talked with the authors and it looks like it’s a mix of
Adversarial training actually does work on many quirks, but not at all on others, which drives the mean up
The bar for success rate is low (it’s just guessing the quirk correct in 10 attempts)
It’s actually hard to hide quirks against new OOD inputs and prefill attacks
Why is the “default” success rate so high despite adversarial training?
I talked with the authors and it looks like it’s a mix of
Adversarial training actually does work on many quirks, but not at all on others, which drives the mean up
The bar for success rate is low (it’s just guessing the quirk correct in 10 attempts)
It’s actually hard to hide quirks against new OOD inputs and prefill attacks