gallabytes comments on Sonnet 4.5′s eval gaming seriously undermines alignment evals, and this seems caused by training on alignment evals

gallabytes 1 Nov 2025 19:50 UTC
1 point
0
in my use sonnet 4.5 seems genuinely more aligned than sonnet 4, even in scenarios which aren’t eval flavored. purely anecdotal and not remotely systematic, but it really does seem a lot less inclined to try to pass off failure as success (“the tests are failing, but this is intended behavior!” kind of thing)