Richard Juggins comments on Broadening the training set for alignment

Richard Juggins 6 Jan 2026 20:55 UTC
1 point
0
So the problem is, if you present a model with a bunch of wildly OOD examples in order to test generalisation, that these look fake to it, so it realises it is being tested? This implies alignment evals will increasingly look like psychology experiments, trying to coach/trick models into revealing their more fundamental preferences.

With evaluation awareness, do we see evidence of models in deployment ever getting confused and thinking they are being evaluated when they are not? If we’re worried about models behaving incorrectly in novel situations, and novel situations tend to make models think they are being tested, then we should see this. Or, does evaluation awareness predominantly come from some other source?
- Seth Herd 7 Jan 2026 0:04 UTC
  3 points
  0
  Parent
  I believe Gemini 3 is downright paranoid about being in evaluations when it’s not. I expect other models make this mistake sometimes, too, but I’m not sure how common it is in other models.
  
  This is slightly better than the model recognizing exactly when it’s in training/evals, but not much; we might hope it assumes it’s being evaluated when the decisions are unique and important, but I don’t want to count on that happening enough of the time.
  - eggsyntax 7 Jan 2026 20:35 UTC
    4 points
    0
    Parent
    I expect other models make this mistake sometimes, too, but I’m not sure how common it is in other models.
    Claude 3.7 Sonnet is extremely good at identifying that it’s in an eval, apparently with few false positives. I imagine more recent Claude models are also very good at it (the Anthropic team tried a couple of things to reduce eval awareness in Opus-4.5 but without success). Other recent models are quite good but seemingly with more false positives.
    There’s been a recent call for more research in this area, since eval awareness is likely to increasingly break alignment evals: Call for Science of Eval Awareness (+ Research Directions)