We made a mistake in not being more transparent about OpenAI’s involvement. We were restricted from disclosing the partnership until around the time o3 launched, and in hindsight we should have negotiated harder for the ability to be transparent to the benchmark contributors as soon as possible. Our contract specifically prevented us from disclosing information about the funding source and the fact that OpenAI has data access to much but not all of the dataset. We own this error and are committed to doing better in the future.
For future collaborations, we will strive to improve transparency wherever possible, ensuring contributors have clearer information about funding sources, data access, and usage purposes at the outset. While we did communicate that we received lab funding to some mathematicians, we didn’t do this systematically and did not name the lab we worked with. This inconsistent communication was a mistake. We should have pushed harder for the ability to be transparent about this partnership from the start, particularly with the mathematicians creating the problems.
Getting permission to disclose OpenAI’s involvement only around the o3 launch wasn’t good enough. Our mathematicians deserved to know who might have access to their work. Even though we were contractually limited in what we could say, we should have made transparency with our contributors a non-negotiable part of our agreement with OpenAI.
Regarding training usage: We acknowledge that OpenAI does have access to a large fraction of FrontierMath problems and solutions, with the exception of a unseen-by-OpenAI hold-out set that enables us to independently verify model capabilities. However, we have a verbal agreement that these materials will not be used in model training.
Relevant OpenAI employees’ public communications have described FrontierMath as a ‘strongly held out’ evaluation set. While this public positioning aligns with our understanding, I would also emphasize more broadly that labs benefit greatly from having truly uncontaminated test sets.
OpenAI has also been fully supportive of our decision to maintain a separate, unseen holdout set—an extra safeguard to prevent overfitting and ensure accurate progress measurement. From day one, FrontierMath was conceived and presented as an evaluation tool, and we believe these arrangements reflect that purpose.
[Edit: Clarified OpenAI’s data access—they do not have access to a separate holdout set that serves as an additional safeguard for independent verification.]
I’m not sure that would be particularly reassuring to me (writing as one of the contributors). First, how would one check that the agreement had been adhered to (maybe it’s possible, I don’t know)? Second, people in my experience often don’t notice they are training on data (as mentioned in a post above by ozziegooen).
I agree entirely that it would not be very reassuring, for the reasons you explained. But I would still consider it a mildly interesting signal to see if OpenAI would be willing to provide such an agreement in writing, and maybe make a public statement on the precise way they used the data so far.
Also: if they make a legally binding commitment, and then later evidence shows up that they violated the terms of this agreement (e.g. via whistleblowers), I do think that this is a bigger legal risk for them than breeching some fuzzy verbal agreement.
I found this extra information very useful, thanks for revealing what you did.
Of course, to me this makes OpenAI look quite poor. This seems like an incredibly obvious conflict of interest.
I’m surprised that the contract didn’t allow Epoch to release this information until recently, but that it does allow Epoch to release the information after. This seems really sloppy for OpenAI. I guess they got a bit extra publicity when o3 was released (even though the model wasn’t even available), but now it winds up looking worse (at least for those paying attention). I’m curious if this discrepancy was maliciousness or carelessness.
Hiding this information seems very similar to lying to the public. So at very least, from what I’ve seen, I don’t feel like we have many reasons to trust their communications—especially their “tweets from various employees.”
> However, we have a verbal agreement that these materials will not be used in model training. I imagine I can speak for a bunch of people here when I can say I’m pretty skeptical. At very least, it’s easy for me to imagine situations where the data wasn’t technically directly used in the training, but was used by researchers when iterating on versions, to make sure the system was going in the right direction. This could lead to a very blurry line where they could do things that aren’t [literal LLM training] but basically achieve a similar outcome.
However, we have a verbal agreement that these materials will not be used in model training.
If by this you mean “OpenAI will not train on this data”, that doesn’t address the vast majority of the concern. If OpenAI is evaluating the model against the data, they will be able to more effectively optimize for capabilities advancement, and that’s a betrayal of the trust of the people who worked on this with the understanding that it will be used only outside of the research loop to check for dangerous advancements. And, particularly, not to make those dangerous advancements come sooner by giving OpenAI another number to optimize for.
If you mean OpenAI will not be internally evaluating models on this to improve and test the training process, please state this clearly in writing (and maybe explain why they got privileged access to the data despite being prohibited from the obvious use of that data).
full transparency about any funding from for profit organisations, including nonprofit organizations affiliated with for profit
no access to the benchmarks to any company
no NDAs around this stuff
If you currently have any of these with the computer use benchmark in development, you should seriously try to get out of those contractual obligations if there are any.
Ideally, you commit to these in a legally binding way, which would make it non-negotiable in any negotiation, and make you more credible to outsiders.
We could also ask if these situations exist (“is there any funder you have that you didn’t disclose?” and so on, especially around NDAs), and Epoch could respond with Yes/No/Can’tReply[1].
Also seems relevant for other orgs.
This would only patch the kind of problems we can easily think about, but it seems to me like a good start
Thank you for the clarification! What I would be curious about: you write
OpenAI does have access to a large fraction of FrontierMath problems and solutions
Does this include the detailed solution write-up (mathematical arguments, in LaTeX) or just the final answer (numerical result of the question / Python script verifying the correctness of the AI response)?
Just to confirm, you will be benchmarking models other than OpenAI models using this dataset and you aren’t contractually prevented from doing this right?
(The original blog post cites scores of models from multiple developers, so I assume so.)
We acknowledge that OpenAI does have access to a large fraction of FrontierMath problems and solutions, with the exception of a unseen-by-OpenAI hold-out set that enables us to independently verify model capabilities.
Can you say exactly how large of a fraction is the set that OpenAI has access to, and how much is the hold-out set?
Not Tamay, but from elliotglazer on Reddit[1] (14h ago): “Epoch’s lead mathematician here. Yes, OAI funded this and has the dataset, which allowed them to evaluate o3 in-house. We haven’t yet independently verified their 25% claim. To do so, we’re currently developing a hold-out dataset and will be able to test their model without them having any prior exposure to these problems.
My personal opinion is that OAI’s score is legit (i.e., they didn’t train on the dataset), and that they have no incentive to lie about internal benchmarking performances. However, we can’t vouch for them until our independent evaluation is complete.”
Currently developing a hold-out dataset gives a different impression than
“We acknowledge that OpenAI does have access to a large fraction of FrontierMath problems and solutions, with the exception of a unseen-by-OpenAI hold-out set that enables us to independently verify model capabilities” and “they do not have access to a separate holdout set that serves as an additional safeguard for independent verification.”
Creating further even harder datasets could plausibly accelerate OpenAI’s progress. I read on twitter that people are working on an even harder dataset now. I would not give them access to this, they may break their promise not to train on this if it allows them to accelerate progress. This is extremely valuable training data that you have handed to them.
This is extremely informative, especially the bit about the holdout set. I think it’d reassure a lot of people about the FrontierMath’s validity to know more here. Have you used it to assess any of OpenAI’s models? If so, how, and what were the results?
For future collaborations, we will strive to improve transparency wherever possible, ensuring contributors have clearer information about funding sources, data access, and usage purposes at the outset.
Would you make a statement that would make you legally liable/accountable on this?
Tamay from Epoch AI here.
We made a mistake in not being more transparent about OpenAI’s involvement. We were restricted from disclosing the partnership until around the time o3 launched, and in hindsight we should have negotiated harder for the ability to be transparent to the benchmark contributors as soon as possible. Our contract specifically prevented us from disclosing information about the funding source and the fact that OpenAI has data access to much but not all of the dataset. We own this error and are committed to doing better in the future.
For future collaborations, we will strive to improve transparency wherever possible, ensuring contributors have clearer information about funding sources, data access, and usage purposes at the outset. While we did communicate that we received lab funding to some mathematicians, we didn’t do this systematically and did not name the lab we worked with. This inconsistent communication was a mistake. We should have pushed harder for the ability to be transparent about this partnership from the start, particularly with the mathematicians creating the problems.
Getting permission to disclose OpenAI’s involvement only around the o3 launch wasn’t good enough. Our mathematicians deserved to know who might have access to their work. Even though we were contractually limited in what we could say, we should have made transparency with our contributors a non-negotiable part of our agreement with OpenAI.
Regarding training usage: We acknowledge that OpenAI does have access to a large fraction of FrontierMath problems and solutions, with the exception of a unseen-by-OpenAI hold-out set that enables us to independently verify model capabilities. However, we have a verbal agreement that these materials will not be used in model training.
Relevant OpenAI employees’ public communications have described FrontierMath as a ‘strongly held out’ evaluation set. While this public positioning aligns with our understanding, I would also emphasize more broadly that labs benefit greatly from having truly uncontaminated test sets.
OpenAI has also been fully supportive of our decision to maintain a separate, unseen holdout set—an extra safeguard to prevent overfitting and ensure accurate progress measurement. From day one, FrontierMath was conceived and presented as an evaluation tool, and we believe these arrangements reflect that purpose.
[Edit: Clarified OpenAI’s data access—they do not have access to a separate holdout set that serves as an additional safeguard for independent verification.]
Get that agreement in writing.
I am happy to bet 1:1 OpenAI will refuse to make an agreement in writing to not use the problems/the answers for training.
You have done work that contributes to AI capabilities, and you have misled mathematicians who contributed to that work about its nature.
I’m not sure that would be particularly reassuring to me (writing as one of the contributors). First, how would one check that the agreement had been adhered to (maybe it’s possible, I don’t know)? Second, people in my experience often don’t notice they are training on data (as mentioned in a post above by ozziegooen).
I agree entirely that it would not be very reassuring, for the reasons you explained. But I would still consider it a mildly interesting signal to see if OpenAI would be willing to provide such an agreement in writing, and maybe make a public statement on the precise way they used the data so far.
Also: if they make a legally binding commitment, and then later evidence shows up that they violated the terms of this agreement (e.g. via whistleblowers), I do think that this is a bigger legal risk for them than breeching some fuzzy verbal agreement.
I found this extra information very useful, thanks for revealing what you did.
Of course, to me this makes OpenAI look quite poor. This seems like an incredibly obvious conflict of interest.
I’m surprised that the contract didn’t allow Epoch to release this information until recently, but that it does allow Epoch to release the information after. This seems really sloppy for OpenAI. I guess they got a bit extra publicity when o3 was released (even though the model wasn’t even available), but now it winds up looking worse (at least for those paying attention). I’m curious if this discrepancy was maliciousness or carelessness.
Hiding this information seems very similar to lying to the public. So at very least, from what I’ve seen, I don’t feel like we have many reasons to trust their communications—especially their “tweets from various employees.”
> However, we have a verbal agreement that these materials will not be used in model training.
I imagine I can speak for a bunch of people here when I can say I’m pretty skeptical. At very least, it’s easy for me to imagine situations where the data wasn’t technically directly used in the training, but was used by researchers when iterating on versions, to make sure the system was going in the right direction. This could lead to a very blurry line where they could do things that aren’t [literal LLM training] but basically achieve a similar outcome.
If by this you mean “OpenAI will not train on this data”, that doesn’t address the vast majority of the concern. If OpenAI is evaluating the model against the data, they will be able to more effectively optimize for capabilities advancement, and that’s a betrayal of the trust of the people who worked on this with the understanding that it will be used only outside of the research loop to check for dangerous advancements. And, particularly, not to make those dangerous advancements come sooner by giving OpenAI another number to optimize for.
If you mean OpenAI will not be internally evaluating models on this to improve and test the training process, please state this clearly in writing (and maybe explain why they got privileged access to the data despite being prohibited from the obvious use of that data).
I think you should publicly commit to:
full transparency about any funding from for profit organisations, including nonprofit organizations affiliated with for profit
no access to the benchmarks to any company
no NDAs around this stuff
If you currently have any of these with the computer use benchmark in development, you should seriously try to get out of those contractual obligations if there are any.
Ideally, you commit to these in a legally binding way, which would make it non-negotiable in any negotiation, and make you more credible to outsiders.
We could also ask if these situations exist (“is there any funder you have that you didn’t disclose?” and so on, especially around NDAs), and Epoch could respond with Yes/No/Can’tReply[1].
Also seems relevant for other orgs.
This would only patch the kind of problems we can easily think about, but it seems to me like a good start
I learned that trick from hpmor!
How much funding did OpenAI provide EpochAI?
Or, how much funding do you expect to receive in total from OpenAI for FrontierMath if you haven’t received all funding yet?
William,
not exactly an answer to your question but BOTEC estimate for FrontierMath costs: $400k — $2M
Thank you for the clarification! What I would be curious about: you write
Does this include the detailed solution write-up (mathematical arguments, in LaTeX) or just the final answer (numerical result of the question / Python script verifying the correctness of the AI response)?
Just to confirm, you will be benchmarking models other than OpenAI models using this dataset and you aren’t contractually prevented from doing this right?
(The original blog post cites scores of models from multiple developers, so I assume so.)
Yes.
Can you say exactly how large of a fraction is the set that OpenAI has access to, and how much is the hold-out set?
Not Tamay, but from elliotglazer on Reddit[1] (14h ago): “Epoch’s lead mathematician here. Yes, OAI funded this and has the dataset, which allowed them to evaluate o3 in-house. We haven’t yet independently verified their 25% claim. To do so, we’re currently developing a hold-out dataset and will be able to test their model without them having any prior exposure to these problems.
My personal opinion is that OAI’s score is legit (i.e., they didn’t train on the dataset), and that they have no incentive to lie about internal benchmarking performances. However, we can’t vouch for them until our independent evaluation is complete.”
Currently developing a hold-out dataset gives a different impression than
“We acknowledge that OpenAI does have access to a large fraction of FrontierMath problems and solutions, with the exception of a unseen-by-OpenAI hold-out set that enables us to independently verify model capabilities” and “they do not have access to a separate holdout set that serves as an additional safeguard for independent verification.”
Emphasis mine. He also mentions “the holdout set we are developing” on Twitter.
Creating further even harder datasets could plausibly accelerate OpenAI’s progress. I read on twitter that people are working on an even harder dataset now. I would not give them access to this, they may break their promise not to train on this if it allows them to accelerate progress. This is extremely valuable training data that you have handed to them.
Suggested market. Happy to take suggestions on how to improve it:
https://manifold.markets/NathanpmYoung/will-o3-perform-as-well-on-the-fron?play=true
This is extremely informative, especially the bit about the holdout set. I think it’d reassure a lot of people about the FrontierMath’s validity to know more here. Have you used it to assess any of OpenAI’s models? If so, how, and what were the results?
Would you make a statement that would make you legally liable/accountable on this?