Alexa Pan comments on Sonnet 4.5′s eval gaming seriously undermines alignment evals, and this seems caused by training on alignment evals

Alexa Pan 3 Nov 2025 18:44 UTC
1 point
0
Thanks for flagging, it’s interesting to know that this idea has consideration traction. Maybe we should have responded in more detail given this.

Our response was mainly the below:
If models can’t confidently distinguish test and deployment, they may be super paranoid about being tested and behave well out of caution in deployment. This is still bad news, since it suggests that more capable models might behave dangerously in deployment, but doesn’t provide clear warning signs or examples of misbehavior in the wild to study.
Also, this doesn’t seem like a scalable solution to more capable models which will probably distinguish test/deployment at some point.