J Bostock comments on Claude Sonnet 4.5: System Card and Alignment

J Bostock 30 Sep 2025 23:08 UTC
4 points
2
I think we have evidence that this is not the primary thing going on, because there are parts of the evaluation where the model lacks context, could easily face an identical situation in deployment, and thus can’t know it is being evaluated.
This in particular is not strong evidence, since the optimal strategy for a schemer is usually misbehave only very rarely, when it’s very confident that it isn’t in an evaluation. This is because the number of deployment runs will be >> the number of evaluations. If it’s unable to tell, then it should probably just not misbehave in that instance.