ryan_greenblatt comments on Sonnet 4.5′s eval gaming seriously undermines alignment evals, and this seems caused by training on alignment evals

ryan_greenblatt 31 Oct 2025 16:50 UTC
8 points
4
This seems non-crazy, but my current guess would be that the training for 4.5 (relative to 4) made both verbalized and unverbalized eval awareness more likely. E.g., I don’t have much confidence that 4.5 isn’t aware it is in an eval even if it doesn’t mention this in the cot (especially if it sometimes mentions this for pretty similar inputs or sometimes mentions this for this input). Like, it’s good that we know eval awareness is an issue in 4.5, but it’s hard to be confident you’ve avoided this issue in some eval by looking for verbalized eval awareness!

(If we somehow became confident that 4.5 is eval aware if it verbalizes with some experiments/science that could change my perspective.)

If you aggressively optimize for finding training environments that train against misalignment and which don’t increase verbalized eval awareness, it does seem plausible you find environments that make unverbalized eval awareness more likely. I’d propose also not doing this and instead trying to roughly understand what makes eval awareness increase and avoid this in general.
- Bronson Schoen 2 Nov 2025 21:20 UTC
  1 point
  0
  Parent
  I’d propose also not doing this and instead trying to roughly understand what makes eval awareness increase and avoid this in general.
  
  I also agree with this approach, my worry is the incentive structures here are:
  - It’s always going to be cheap to generate synthetic environments like this
  - It seems like currently (and my guess going forward, but uncertain here) is that the ability of generation N models to generate environments “realistic enough” for generation N+1 models is always going to be lagging behind where we’d need it to be
  - The cheap generated synthetic environments will make your model seem more aligned while making misaligned actions harder and harder to spot
  I’d be a bit more optimistic about some training intervention that reduced the salience of the features contributing to unverbalized evaluation awareness rather than trying to get labs to avoid training on the cheap synthetic scenarios in the first place. The first step to figuring out a training intervention is to understand the cause of the increase in detail anyway though, so more just writing this out in case you already had ideas or intuitions here.