habryka comments on Gemini 3 is Evaluation-Paranoid and Contaminated

habryka 21 Nov 2025 21:12 UTC
6 points
3
That makes sense! I do think this sounds like a pretty tricky problem, as e.g. a machine translation of an eval might very well make it into the training set, or plots of charts that de-facto encode the evals themselves.
But it does sound like there is some substantial effort going into preventing at least the worst error modes here. Thank you for the clarification!