A “true misalignment” evaluation (under this definition of misalignment) needs to probe for optimizing behaviour. So for example, if a model is misaligned such that it wants to call a function with an unsanitized input, you should give it a menu of functions, switch which one has the unsanitized input, and check that it calls whichever function has the unsanitized input.
A “true misalignment” evaluation (under this definition of misalignment) needs to probe for optimizing behaviour. So for example, if a model is misaligned such that it wants to call a function with an unsanitized input, you should give it a menu of functions, switch which one has the unsanitized input, and check that it calls whichever function has the unsanitized input.