Marius Hobbhahn comments on Claude Sonnet 3.7 (often) knows when it’s in alignment evaluations

Marius Hobbhahn 2 Jun 2025 18:42 UTC
2 points
0
In one of my MATS projects we found that some models have a bias to think they’re always being evaluated, including in real-world scenarios. The paper isn’t public yet. But it seems like a pretty brittle belief that the models don’t hold super strongly. I think this can be part of a strategy, but should never be load-bearing.