plex comments on Sonnet 4.5′s eval gaming seriously undermines alignment evals, and this seems caused by training on alignment evals

plex 1 Nov 2025 10:37 UTC
7 points
0
Prediction: future models not trained on alignment evals will also have greater awareness than that would have had this model not been trained on alignment evals/ones with this model’s outputs filtered out reliably, due to patterns from these eval trained ones being picked up from the training data. (Though likely still less than this one)