When Anthropic evaluated Sonnet 4.5 on their existing alignment tests[5], they found that the rate of misalignment had unexpectedly dropped to 0%. Had they solved alignment?! Unfortunately not. Sonnet 4.5 was instead much better at recognizing when it was being evaluated, and deliberately choosing the most ethical action to pass. The tests weren’t measuring alignment; they were measuring evaluation awareness.
Jack’s team subtracted an evaluation-awareness steering vector, partially mitigating the problem and revealing that Sonnet 4.5 did in fact show “misalignment” at a rate of roughly 8%. That is, our ability to evaluate a frontier model for alignment broke, and interpretability researchers were able to fix it.
Did they consider steering Sonnet 4.5 in production to think that it was being evaluated? If that was practical (without degrading performance), then they would have just reduced (their tested forms of) misalignment to 0%
Did they consider steering Sonnet 4.5 in production to think that it was being evaluated? If that was practical (without degrading performance), then they would have just reduced (their tested forms of) misalignment to 0%