Mads U

Karma: 0

Mads U 16 Nov 2025 17:36 UTC
1 point
0
on: Sonnet 4.5′s eval gaming seriously undermines alignment evals, and this seems caused by training on alignment evals
Does this mean that the model will always behave nicely, if it always thinks it is being tested?