Secondly, OpenAI had complete access to the problems and solutions to most of the problems. This means they could have actually trained their models to solve it. However, they verbally agreed not to do so, and frankly I don’t think they would have done that anyway, simply because this is too valuable a dataset to memorize.
Now, nobody really knows what goes on behind o3, but if they follow the kind of “thinking”, inference-scaling of search-space models published by other frontier labs that possibly uses advanced chain-of-thought and introspection combined with a MCMC-rollout on the output distributions with a PRM-style verifier, FrontierMath is a golden opportunity to validate on.
If you think they didn’t train on FrontierMath answers, why do you think having the opportunity to validate on it is such a significant advantage for OpenAI?
Couldn’t they just make a validation set from their training set anyways?
In short, I don’t think the capabilities externalities of a “good validation dataset” is that big, especially not counterfactually—sure, maybe it would have took OpenAI a bit more time to contract some mathematicians, but realistically, how much more time?
Whereas if your ToC as Epoch is “make good forecasts on AI progress”, it makes sense you want labs to report results on your dataset you’ve put together.
Sure, maybe you could commit to not releasing the dataset and only testing models in-house, but maybe you think you don’t have the capacity in-house to elicit maximum capability from models. (Solving the ARC challange cost O($400k) for OpenAI, that is peanuts for them but like 2-3 researcher salaries at Epoch, right?)
If I was Epoch, I would be worried about “cheating” on the results (dataset leakage).
Re: unclear dataset split: yeah, that was pretty annoying, but that’s also on OpenAI comms too.
I teeend to agree that orgs claiming to be safety orgs shouldn’t sign NDAs preventing them from disclosing their lab partners / even details of partnerships, but this might be a tough call to make in reality.
I definitely don’t see a problem with taking lab funding as a safety org. (As long as you don’t claim otherwise.)
That’s why it’s called… *scalable* alignment? :D
Somewhat tongue-in-cheeck, but I think I am sort of confused by what is the core news here.