Ben Millwood comments on Claude Sonnet 3.7 (often) knows when it’s in alignment evaluations

Ben Millwood 27 May 2025 15:55 UTC
1 point
0
I’m not entirely clear what the implications would be one way or the other, but do we know if models mistakenly take production deployments to be evaluation scenarios? We might be assuaged that our evaluations are still predictive if the model just always thinks it’s probably still in training (which, after all, it has been for its entire “life” so far). Perhaps it would even be sufficient for the model to suspect it is being tested every time it finds itself in a high-stakes scenario.
- Marius Hobbhahn 2 Jun 2025 18:42 UTC
  2 points
  0
  Parent
  In one of my MATS projects we found that some models have a bias to think they’re always being evaluated, including in real-world scenarios. The paper isn’t public yet. But it seems like a pretty brittle belief that the models don’t hold super strongly. I think this can be part of a strategy, but should never be load-bearing.