I work at GDM so obviously take that into account here, but in my internal conversations about external benchmarks we take cheating very seriously—we don’t want eval data to leak into training data, and have multiple lines of defense to keep that from happening.
What do you mean by “we”? Do you work on the pretraining team, talk directly with the pretraining team, are just aware of the methods the pretraining team uses, or some other thing?
I don’t work directly on pretraining, but when there were allegations of eval set contamination due to detection of a canary string last year, I looked into it specifically. I read the docs on prevention, talked with the lead engineer, and discussed with other execs.
So I have pretty detailed knowledge here. Of course GDM is a big complicated place and I certainly don’t know everything, but I’m confident that we are trying hard to prevent contamination.
What do you mean by “we”? Do you work on the pretraining team, talk directly with the pretraining team, are just aware of the methods the pretraining team uses, or some other thing?
I don’t work directly on pretraining, but when there were allegations of eval set contamination due to detection of a canary string last year, I looked into it specifically. I read the docs on prevention, talked with the lead engineer, and discussed with other execs.
So I have pretty detailed knowledge here. Of course GDM is a big complicated place and I certainly don’t know everything, but I’m confident that we are trying hard to prevent contamination.