Dave Orr comments on Gemini 3 is Evaluation-Paranoid and Contaminated

Dave Orr 20 Nov 2025 21:41 UTC
36 points
−3
I’m no longer at GDM but I am confident they are not training on evals. They took that super seriously when I was there and I have no reason to think that would have changed.
What they don’t do is filter out every web page that has the canary string. Since people put them on random web pages (like this one), which was not their intended use, they get into the training data.
Also they do directly hill climb on high profile metrics, which means they don’t necessarily measure what they were intended to measure when they were created.
(Super interesting documentation of all that crazy CoT going on.)
- Hastings 20 Nov 2025 23:31 UTC
  37 points
  8
  Parent
  It sounds like they don’t filter out the canary string from general web text even though it would be cheap and filter out a really tiny proportion of the training data. Maybe that tiny proportion yields a disproportionate boost in performance. Hmm, wonder why that could be.
  - Alice Blair 20 Nov 2025 23:47 UTC
    5 points
    0
    Parent
    Well of course it yields a disproportionate boost, it contains information about the benchmarks, even if it doesn’t contain the exact questions. Unless that was intended subtext and you were being sarcastic?
    - Hastings 21 Nov 2025 0:00 UTC
      11 points
      0
      Parent
      Yep, sarcastic. Sorry, someday I’ll learn not to do that on the Internet, but I’m not holding my breath.
- gwern 21 Nov 2025 4:28 UTC
  20 points
  20
  Parent
  
  Since people put them on random web pages (like this one)
  
  In what sense is this page about Gemini benchmarking and LLM canary strings a ‘random web page’ to a Gemini LLM?
  - Dave Orr 21 Nov 2025 20:41 UTC
    4 points
    −5
    Parent
    I mean, it doesn’t contain eval data and isn’t part of the eval set that the canary string is from. So the canary string is not serving its intended purpose of marking evals that you should not include in training data.
- peterbarnett 21 Nov 2025 17:47 UTC
  16 points
  12
  Parent
  What they don’t do is filter out every web page that has the canary string. Since people put them on random web pages (like this one), which was not their intended use, they get into the training data.
  As others have mentioned, this seems kinda crazy and bad. I was surprised you didn’t think this.
  “Unrelated” question, but are you under a non-disparagement agreement with GDM that would prevent you from criticizing things like their data-filtering practices?
  - Dave Orr 21 Nov 2025 21:02 UTC
    7 points
    0
    Parent
    I am not under nondisparagement agreements from anyone and feel free to criticize GDM. I do still have friends there, of course. I certainly wouldn’t be correcting misapprehensions about GDM if I didn’t believe what I was saying!
    - peterbarnett 21 Nov 2025 21:04 UTC
      2 points
      0
      Parent
      Thanks!
- habryka 21 Nov 2025 7:41 UTC
  12 points
  9
  Parent
  I mean, aren’t you training on evals simply by pretraining on random blogposts that have the eval text?
  Like, are there more sophisticated algorithms than “watch for the canary string” that do keep the evals out of the training set?
  If not, I think the right way to describe this is to say that the models are trained on the evals. Like, sure, maybe they aren’t setting up a full RL environment for all of them, but there is still of course a huge effect size from that.
  - Dave Orr 21 Nov 2025 20:38 UTC
    12 points
    2
    Parent
    Yeah—without going into too much detail, what they actually do is look for text from the evals and filter those out. This is way more reliable than looking for the canary string, because there are tons of cases or people talking about specific eval examples without including the canary string. So you really have to do something more sophisticated than just look for the canary string.
    So they filter out posts with eval text, which is what you really want.
    - habryka 21 Nov 2025 21:12 UTC
      6 points
      3
      Parent
      That makes sense! I do think this sounds like a pretty tricky problem, as e.g. a machine translation of an eval might very well make it into the training set, or plots of charts that de-facto encode the evals themselves.
      But it does sound like there is some substantial effort going into preventing at least the worst error modes here. Thank you for the clarification!
    - eggsyntax 24 Nov 2025 18:34 UTC
      4 points
      1
      Parent
      Thanks for answering so many questions about this. I can see why it makes sense to filter on text from the evals. What’s the rationale for not also filtering on the canary string as a precaution? I realize there would be some false positives due to abuse, but is that common enough that it would have a significant inappropriate effect?
      I think of the canary string as being useful because it communicates that some researcher has judged the document as likely to corrupt eval / benchmark results. Searching for specific text from evals doesn’t seem like a full substitute for that judgment.
      To be clear, I’m not asking you to justify or defend the decision; I just would like to better understand GDM’s thinking here.
- Alice Blair 20 Nov 2025 21:51 UTC
  5 points
  1
  Parent
  I think it is really important to not train on things that talk about benchmarks or scheming or whatnot, and there is a lot of leakage about especially safety evaluations even when it doesn’t contain the actual evaluation content. People seem to broadly be using canary strings responsibly, to the best of my knowledge, and it is dissapointing that they do not filter it out well.
  I’m confused how “they do directly hill climb on high profile metrics” is compatible with any of this, since that seems to imply that they do in fact train on benchmarks, which is the exact thing you just said was false?
  Edit: maybe you mean the thing where they optimize for LMArena but don’t technically train on it?
  - Thane Ruthenis 21 Nov 2025 1:31 UTC
    4 points
    0
    Parent
    I’m confused how “they do directly hill climb on high profile metrics” is compatible with any of this, since that seems to imply that they do in fact train on benchmarks, which is the exact thing you just said was false?
    I assume it means they use the benchmark as the test set, not the training set.
    - Dave Orr 21 Nov 2025 20:39 UTC
      5 points
      0
      Parent
      Yes this.