Aprillion comments on Sonnet 4.5′s eval gaming seriously undermines alignment evals, and this seems caused by training on alignment evals

Aprillion 31 Oct 2025 10:10 UTC
3 points
2
I am probably missing something, but when talking about “accuracy” .. how did you measure true and false negatives (thinking and not thinking about evals when not in an eval)?