I agree that OpenAI training on Frontier Math seems unlikely, and not in their interests. The thing I find concerning is that having high quality evals is very helpful for finding capabilities improvements—ML research is all about trying a bunch of stuff and seeing what works. As benchmarks saturate, you want new ones to give you more signal. If Epoch have a private benchmark they only apply to new releases, this is fine, but if OpenAI can run it whenever they want, this is plausibly fairly helpful for making better systems faster, since this makes hill climbing a bit easier.
I agree that OpenAI training on Frontier Math seems unlikely, and not in their interests. The thing I find concerning is that having high quality evals is very helpful for finding capabilities improvements—ML research is all about trying a bunch of stuff and seeing what works. As benchmarks saturate, you want new ones to give you more signal. If Epoch have a private benchmark they only apply to new releases, this is fine, but if OpenAI can run it whenever they want, this is plausibly fairly helpful for making better systems faster, since this makes hill climbing a bit easier.