Bronson Schoen comments on Sonnet 4.5′s eval gaming seriously undermines alignment evals, and this seems caused by training on alignment evals

Bronson Schoen 2 Nov 2025 21:25 UTC
1 point
0

In the rest of this post, we examine the evidence behind each of these claims, estimate how much of Sonnet 4.5’s improvements might have been caused by eval gaming, and respond to arguments like: “Isn’t it fine so long as the model always thinks it is in an evaluation in high-stakes contexts?”.

I might’ve missed it, but what section has the response here? FWIW I very strongly agree with this being a bad idea, but have heard it suggested many times independently from a wide variety of people, to the point I’m worried it’ll end up eliciting a lot of work having to “disprove” that it would work.
- Alexa Pan 3 Nov 2025 18:44 UTC
  1 point
  0
  Parent
  Thanks for flagging, it’s interesting to know that this idea has consideration traction. Maybe we should have responded in more detail given this.
  
  Our response was mainly the below:
  If models can’t confidently distinguish test and deployment, they may be super paranoid about being tested and behave well out of caution in deployment. This is still bad news, since it suggests that more capable models might behave dangerously in deployment, but doesn’t provide clear warning signs or examples of misbehavior in the wild to study.
  Also, this doesn’t seem like a scalable solution to more capable models which will probably distinguish test/deployment at some point.