Neel Nanda comments on Sonnet 4.5′s eval gaming seriously undermines alignment evals, and this seems caused by training on alignment evals

Neel Nanda 31 Oct 2025 0:04 UTC
7 points
−3
Why would that happen?

If you’re not training against the chain of thought, verbalizing the eval awareness should be neutral on the reward. So environments that don’t incentivise verbal eval awareness also shouldn’t incentivise non verbal
- Fabien Roger 31 Oct 2025 3:35 UTC
  12 points
  7
  Parent
  An intuition here is that different RL environments could ~randomly change the “scratchpad personality” through indirect effects, with some changes pushing towards more or less blabbing. And so if you search over a large space of safety-training environments and pick all the environments that don’t result in verbalized situational awareness you might be selecting all the environments that push towards a less blabby scratchpad personality.
  (In this specific case of Sonnet 4.5, the indirect optimization effects of removing the RL envs the model card describes are probably very weak, so I think it should not be the primary consideration in the case of Sonnet 4.5 absent some strong reason to expect these particular RL envs to encourage a very blabby personality.)