Secondly, OpenAI had complete access to the problems and solutions to most of the problems. This means they could have actually trained their models to solve it. However, they verbally agreed not to do so, and frankly I don’t think they would have done that anyway, simply because this is too valuable a dataset to memorize.
Now, nobody really knows what goes on behind o3, but if they follow the kind of “thinking”, inference-scaling of search-space models published by other frontier labs that possibly uses advanced chain-of-thought and introspection combined with a MCMC-rollout on the output distributions with a PRM-style verifier, FrontierMath is a golden opportunity to validate on.
If you think they didn’t train on FrontierMath answers, why do you think having the opportunity to validate on it is such a significant advantage for OpenAI?
Couldn’t they just make a validation set from their training set anyways?
In short, I don’t think the capabilities externalities of a “good validation dataset” is that big, especially not counterfactually—sure, maybe it would have took OpenAI a bit more time to contract some mathematicians, but realistically, how much more time?
Whereas if your ToC as Epoch is “make good forecasts on AI progress”, it makes sense you want labs to report results on your dataset you’ve put together.
Sure, maybe you could commit to not releasing the dataset and only testing models in-house, but maybe you think you don’t have the capacity in-house to elicit maximum capability from models. (Solving the ARC challange cost O($400k) for OpenAI, that is peanuts for them but like 2-3 researcher salaries at Epoch, right?)
If I was Epoch, I would be worried about “cheating” on the results (dataset leakage).
Re: unclear dataset split: yeah, that was pretty annoying, but that’s also on OpenAI comms too.
I teeend to agree that orgs claiming to be safety orgs shouldn’t sign NDAs preventing them from disclosing their lab partners / even details of partnerships, but this might be a tough call to make in reality.
I definitely don’t see a problem with taking lab funding as a safety org. (As long as you don’t claim otherwise.)
I definitely don’t see a problem with taking lab funding as a safety org. (As long as you don’t claim otherwise.)
I definitely don’t have a problem with this as well—just that this needs to be much more transparent and carefully though-out than how it happened here.
If you think they didn’t train on FrontierMath answers, why do you think having the opportunity to validate on it is such a significant advantage for OpenAI?
My concern is that “verbally agreeing to not use it for training” leaves a lot of opportunities to still use it as a significant advantage. For instance, do we know that they did not use it indirectly to validate a PRM that could in turn help a lot? I don’t think making a validation set out of their training data would be as effective.
Re: “maybe it would have took OpenAI a bit more time to contract some mathematicians, but realistically, how much more time?”: Not much, they might have done this indepently as well. (assuming the mathematicians they’d contact would be equally willing to contribute directly to OpenAI)
If you think they didn’t train on FrontierMath answers, why do you think having the opportunity to validate on it is such a significant advantage for OpenAI?
Couldn’t they just make a validation set from their training set anyways?
In short, I don’t think the capabilities externalities of a “good validation dataset” is that big, especially not counterfactually—sure, maybe it would have took OpenAI a bit more time to contract some mathematicians, but realistically, how much more time?
Whereas if your ToC as Epoch is “make good forecasts on AI progress”, it makes sense you want labs to report results on your dataset you’ve put together.
Sure, maybe you could commit to not releasing the dataset and only testing models in-house, but maybe you think you don’t have the capacity in-house to elicit maximum capability from models. (Solving the ARC challange cost O($400k) for OpenAI, that is peanuts for them but like 2-3 researcher salaries at Epoch, right?)
If I was Epoch, I would be worried about “cheating” on the results (dataset leakage).
Re: unclear dataset split: yeah, that was pretty annoying, but that’s also on OpenAI comms too.
I teeend to agree that orgs claiming to be safety orgs shouldn’t sign NDAs preventing them from disclosing their lab partners / even details of partnerships, but this might be a tough call to make in reality.
I definitely don’t see a problem with taking lab funding as a safety org. (As long as you don’t claim otherwise.)
I definitely don’t have a problem with this as well—just that this needs to be much more transparent and carefully though-out than how it happened here.
My concern is that “verbally agreeing to not use it for training” leaves a lot of opportunities to still use it as a significant advantage. For instance, do we know that they did not use it indirectly to validate a PRM that could in turn help a lot? I don’t think making a validation set out of their training data would be as effective.
Re: “maybe it would have took OpenAI a bit more time to contract some mathematicians, but realistically, how much more time?”: Not much, they might have done this indepently as well. (assuming the mathematicians they’d contact would be equally willing to contribute directly to OpenAI)