Alice Blair comments on Gemini 3 is Evaluation-Paranoid and Contaminated

Alice Blair 20 Nov 2025 21:51 UTC
8 points
1
I think it is really important to not train on things that talk about benchmarks or scheming or whatnot, and there is a lot of leakage about especially safety evaluations even when it doesn’t contain the actual evaluation content. People seem to broadly be using canary strings responsibly, to the best of my knowledge, and it is dissapointing that they do not filter it out well.
I’m confused how “they do directly hill climb on high profile metrics” is compatible with any of this, since that seems to imply that they do in fact train on benchmarks, which is the exact thing you just said was false?
Edit: maybe you mean the thing where they optimize for LMArena but don’t technically train on it?
- Thane Ruthenis 21 Nov 2025 1:31 UTC
  7 points
  2
  Parent
  I’m confused how “they do directly hill climb on high profile metrics” is compatible with any of this, since that seems to imply that they do in fact train on benchmarks, which is the exact thing you just said was false?
  I assume it means they use the benchmark as the test set, not the training set.
  - Dave Orr 21 Nov 2025 20:39 UTC
    5 points
    0
    Parent
    Yes this.