J Bostock comments on Misalignment classifiers: Why they’re hard to evaluate adversarially, and why we’re studying them anyway

J Bostock 15 Aug 2025 15:28 UTC
2 points
0
A “true misalignment” evaluation (under this definition of misalignment) needs to probe for optimizing behaviour. So for example, if a model is misaligned such that it wants to call a function with an unsanitized input, you should give it a menu of functions, switch which one has the unsanitized input, and check that it calls whichever function has the unsanitized input.