Seth Herd comments on Broadening the training set for alignment

Seth Herd 7 Jan 2026 0:04 UTC
3 points
0
I believe Gemini 3 is downright paranoid about being in evaluations when it’s not. I expect other models make this mistake sometimes, too, but I’m not sure how common it is in other models.

This is slightly better than the model recognizing exactly when it’s in training/evals, but not much; we might hope it assumes it’s being evaluated when the decisions are unique and important, but I don’t want to count on that happening enough of the time.
- eggsyntax 7 Jan 2026 20:35 UTC
  4 points
  0
  Parent
  I expect other models make this mistake sometimes, too, but I’m not sure how common it is in other models.
  Claude 3.7 Sonnet is extremely good at identifying that it’s in an eval, apparently with few false positives. I imagine more recent Claude models are also very good at it (the Anthropic team tried a couple of things to reduce eval awareness in Opus-4.5 but without success). Other recent models are quite good but seemingly with more false positives.
  There’s been a recent call for more research in this area, since eval awareness is likely to increasingly break alignment evals: Call for Science of Eval Awareness (+ Research Directions)