>Noam Brown: “Today, we at @OpenAI achieved a milestone that many considered years away: gold medal-level performance on the 2025 IMO with a general reasoning LLM—under the same time limits as humans, without tools. As remarkable as that sounds, it’s even more significant than the headline” https://x.com/polynoamial/status/1946478249187377206
>”Progress here calls for going beyond the RL paradigm of clear-cut, verifiable rewards. By doing so, we’ve obtained a model that can craft intricate, watertight arguments at the level of human mathematicians.” >”We reach this capability level not via narrow, task-specific methodology, but by breaking new ground in general-purpose reinforcement learning and test-time compute scaling.” https://x.com/alexwei_/status/1946477749566390348
More interesting than the score is the implication that these were pass@1 results i.e. the model produced a single final “best shot” for each question that at the end of 4.5 hours was passed off to human graders, instead of pass@1000 with literal thousands of automated attempts. If true this suggests that test time scaling is now moving away from the “spray and pray” paradigm. Feels closer to “actually doing thinking”. This is kinda scary.
Eh. Scaffolds that involve agents privately iterating on ideas and then outputting a single result are a known approach, see e. g. this, or Deep Research, or possibly o1 pro/o3 pro. I expect it’s something along the same lines, except with some trick that makes it work better than ever before… Oh, come to think of it, Noam Brown did have that interview I was meaning to watch, about “scaling test-time compute to multi-agent civilizations”. That sounds relevant.
I mean, it can be scary, for sure; no way to be certain until we see the details.
But not that unpleasant, I guess. I really wonder what people think when they see a benchmark on which LLMs get 30%, and then confidently say that 80% is “years away”. Obviously if LLMs already get 30%, it proves they’re fundamentally capable of solving that task[1], so the benchmark will be saturated once AI researchers do more of the same. Hell, Gemini 2.5 Pro apparently got 5⁄7 (71%) on one of the problems, so clearly outputting 5/7-tier answers to IMO problems was a solved problem, so an LLM model getting at least 6*5 = 30 out of 42 in short order should have been expected. How was this not priced in...?
Hmm, I think there’s a systemic EMH failure here. People appear to think that the time-to-benchmark-saturation scales with the difference between the status of a human able to reach the current score and the status of a human able to reach the target score, instead of estimating it using gears-level models of how AI works. You can probably get free Manifold mana by looking at supposedly challenging benchmarks, looking at which ones have above-10% scores already, then being more bullish on them than the market.
ARC-AGI-2 seems like the obvious one.
“We reach this capability level not via narrow, task-specific methodology, but by breaking new ground in general-purpose reinforcement learning and test-time compute scaling.”
I don’t like the sound of that, but if this is their headline result, I’m still sleeping and my update is that people are bad at thinking about benchmarks.
But not that unpleasant, I guess. I really wonder what people think when they see a benchmark on which LLMs get 30%, and then confidently say that 80% is “years away”. Obviously if LLMs already get 30%, it proves they’re fundamentally capable of solving that task[1], so the benchmark will be saturated once AI researchers do more of the same. Hell, Gemini 2.5 Pro apparently got 5⁄7 (71%) on one of the problems, so clearly outputting 5/7-tier answers to IMO problems was a solved problem, so an LLM model getting at least 6*5 = 30 out of 42 in short order should have been expected. How was this not priced in...?
Agreed, I don’t really get how this could be all that much of an update. I think the cynical explanation here is probably correct, which is that most pessimism is just vibes based (as well as most optimism).
Note that the likely known SOTA was even higher than 30%. Google never released Gemini 2.5 Pro Deep think, which they claimed scored 49% on the USAMO (vs. 34.5% for Gemini 2.5-05-06). Little hard to convert this to an implied IMO score (especially because matharena.ai has Gemini 2.5 oddly having a significantly lower USAMO score for the June model (24%), though similar IMO score (~31.5%) but my guess is Deep Think would get somewhere between 37% and 45% on the IMO. 81% remains a huge jump of course.
Hmm, I think there’s a systemic EMH failure here.
Perhaps, perhaps not. Substantial weight was on the “no one bothers” case—no one was reporting such high scores on the USAMO (pretty similar difficulty to IMO) and the market started dropping rapidly after the USAMO date. Note that we were still at 50% odds of IMO gold a week ago—but the lack of news of anyone trying drove it down to ~26%.
Interestingly, I can find write-ups roughly predicting order of AI difficulty. Looking at gemini-2.5 pro’s result so far, using alphageometry would have guaranteed problem 2, so assuming Pro Deep Think only boosted performance on the non-geometric problems, we’d be at a 58% using deep think + alphageometry, giving Bronze and close to Silver. I think it was reasonable to assume an extra 4+ months (2 months timeline, 2 months labs being ahead of release) + more compute would have given the 2 more points to get silver.
What is surprising is that generalist LLM got better at combinatorics (problem 1) and learned to solve geometry problems well. I’m neither an AI nor math competition expert, so can’t opine whether this is a qualitative gain or just an example of a company targeting these specific problems (lots of training on math + lots of inference).
Good point. This does update me downward on Deep Think outperforming matharena’s gemini-2.5-pro IMO run as it is possible Deep Think internally was doing a similar selection process to begin with. Difficult to know without randomly sampling gemini-2.5-pro’s answers and seeing how much the best-of-n selection lifted its score.
Misunderstood the resolution terms. ARC-AGI-2 submissions that are eligible for prizes are constrained as follows:
Unlike the public leaderboard on arcprize.org, Kaggle rules restrict you from using internet APIs, and you only get ~$50 worth of compute per submission. In order to be eligible for prizes, contestants must open source and share their solution and work into the public domain at the end of the competition.
Grok 4 doesn’t count, and whatever frontier model beats it won’t count either. The relevant resolution criterion for frontier model performance on the task is “top score at the public leaderboard”. I haven’t found a market for that.
(You can see how the market in which I hastily made that bet didn’t move in response to Grok 4. That made me suspicious, so I actually read the details, and, well, kind of embarrassing.)
Ugh. Just when I felt I could relax a bit after seeing Grok 4′s lackluster performance. Still, this seems quite suspicious to me. Pretty much everybody is looking into test-time compute and RLVR right now. How come (seemingly) nobody else has found out about this “new general-purpose method” before OpenAI? There is clearly a huge incentive to cheat here, and OpenAI has been shown to not be particularly trustworthy when it comes to test and benchmark results.
Pretty much everybody is looking into test-time compute and RLVR right now. How come (seemingly) nobody else has found out about this “new general-purpose method” before OpenAI?
Well, someone has to be the first, and they got to RLVR itself first last September.
OpenAI has been shown to not be particularly trustworthy when it comes to test and benchmark results
Silently sponsoring FrontierMath and receiving access to the question sets, and, if I remember correctly, o3 and o3-mini performing worse on a later evaluation done on a newer private question set of some sort. Also whatever happened with their irreproducible ARC-AGI results and them later explicitly confirming that the model that Arc Prize got access to in December was different from the released versions, with different training and a special compute tier, despite OpenAI employees claiming that the version of o3 used in the evaluations was fully general and not tailored towards specific tasks.
someone has to be the first
Sure, but I’m just quite skeptical that it’s specifically the lab known for endless hype that does. Besides, a lot less people were looking into RLVR at the time o1-preview was released, so the situations aren’t exactly comparable.
Silently sponsoring FrontierMath and receiving access to the question sets, and, if I remember correctly, o3 and o3-mini performing worse on a later evaluation done on a newer private question set of some sort
IIRC, that worse performance was due to using a worse/less adapted agency scaffold, rather than OpenAI making the numbers up or engaging in any other egregious tampering. Regarding ARC-AGI, the December-2024 o3 and the public o3 are indeed entirely different models, but I don’t think it implies the December one was tailored for ARC-AGI.
I’m not saying OpenAI isn’t liable to exaggerate or massage their results for hype, but I don’t think they ever outright cheated (as far as we know, at least). Especially given that getting caught cheating will probably spell the AI industry’s death, and their situation doesn’t yet look so desperate that they’d risk that.
So I currently expect that the results here are, ultimately, legitimate. What I’m very skeptical about are the implications of those results that people are touting around.
Worse performance was due to using a worse/less adapted agency scaffold
Are you sure? I’m pretty sure that was cited as *one* of the possible reasons, but not confirmed anywhere. I don’t know if some minor scaffolding differences could have that much of an effect on the results (-15%?) in a math benchmark, but if they did, that should have been accounted for in the first place. I don’t think other models were tested with scaffolds specifically engineered for them getting a higher score.
December-2024 o3 and the public o3 are indeed entirely different models, but I don’t think it implies the December one was tailored for ARC-AGI.
As per Arc Prize and what they said OpenAI told them, the December version (“o3-preview”, as Arc Prize named it) had a compute tier above that of any publicly released model. Not only that, they say that the public version of o3 didn’t undergo any RL for ARC-AGI, “not even on the train set”. That seems suspicious to me, because once you train a model on something, you can’t easily untrain it; as per OpenAI, the ARC-AGI train set was “just a tiny fraction of the o3 train set” and, once again, the model used for evaluations is “fully general”. This means that either o3-preview was trained on the ARC-AGI train set somewhere close to the end of the training run and OpenAI was easily able to load an earlier checkpoint to undo that, then not train it on that again for unknown reasons, OR that the public version of o3 was retrained from scratch/a very early checkpoint, then again, not trained on the ARC-AGI data again for unknown reasons, OR that o3-preview was somehow specifically tailored towards ARC-AGI. The latter option seems the most likely to me, especially considering the custom compute tier used in the December evaluation.
>Noam Brown: “Today, we at @OpenAI achieved a milestone that many considered years away: gold medal-level performance on the 2025 IMO with a general reasoning LLM—under the same time limits as humans, without tools. As remarkable as that sounds, it’s even more significant than the headline”
https://x.com/polynoamial/status/1946478249187377206
>”Progress here calls for going beyond the RL paradigm of clear-cut, verifiable rewards. By doing so, we’ve obtained a model that can craft intricate, watertight arguments at the level of human mathematicians.”
>”We reach this capability level not via narrow, task-specific methodology, but by breaking new ground in general-purpose reinforcement learning and test-time compute scaling.” https://x.com/alexwei_/status/1946477749566390348
So there’s some new breakthrough...?
>”o1 thought for seconds. Deep Research for minutes. This one thinks for hours.” https://x.com/polynoamial/status/1946478253960466454
>”LLMs for IMO 2025: gemini-2.5-pro (31.55%), o3 high (16.67%), Grok 4 (11.90%).” https://x.com/denny_zhou/status/1945887753864114438
So public LLMs are bad at IMO, while internal models are getting gold medals? Fascinating
More interesting than the score is the implication that these were pass@1 results i.e. the model produced a single final “best shot” for each question that at the end of 4.5 hours was passed off to human graders, instead of pass@1000 with literal thousands of automated attempts. If true this suggests that test time scaling is now moving away from the “spray and pray” paradigm. Feels closer to “actually doing thinking”. This is kinda scary.
Eh. Scaffolds that involve agents privately iterating on ideas and then outputting a single result are a known approach, see e. g. this, or Deep Research, or possibly o1 pro/o3 pro. I expect it’s something along the same lines, except with some trick that makes it work better than ever before… Oh, come to think of it, Noam Brown did have that interview I was meaning to watch, about “scaling test-time compute to multi-agent civilizations”. That sounds relevant.
I mean, it can be scary, for sure; no way to be certain until we see the details.
Well, that’s mildly unpleasant.
But not that unpleasant, I guess. I really wonder what people think when they see a benchmark on which LLMs get 30%, and then confidently say that 80% is “years away”. Obviously if LLMs already get 30%, it proves they’re fundamentally capable of solving that task[1], so the benchmark will be saturated once AI researchers do more of the same. Hell, Gemini 2.5 Pro apparently got 5⁄7 (71%) on one of the problems, so clearly outputting 5/7-tier answers to IMO problems was a solved problem, so an LLM model getting at least 6*5 = 30 out of 42 in short order should have been expected. How was this not priced in...?
Hmm, I think there’s a systemic EMH failure here. People appear to think that the time-to-benchmark-saturation scales with the difference between the status of a human able to reach the current score and the status of a human able to reach the target score, instead of estimating it using gears-level models of how AI works. You can probably get free Manifold mana by looking at supposedly challenging benchmarks, looking at which ones have above-10% scores already, then being more bullish on them than the market.
ARC-AGI-2 seems like the obvious one.
I don’t like the sound of that, but if this is their headline result, I’m still sleeping and my update is that people are bad at thinking about benchmarks.
Unless the benchmark has difficulty tiers the way e. g. FrontierMath does, which I think IMO doesn’t.
Agreed, I don’t really get how this could be all that much of an update. I think the cynical explanation here is probably correct, which is that most pessimism is just vibes based (as well as most optimism).
Note that the likely known SOTA was even higher than 30%. Google never released Gemini 2.5 Pro Deep think, which they claimed scored 49% on the USAMO (vs. 34.5% for Gemini 2.5-05-06). Little hard to convert this to an implied IMO score (especially because matharena.ai has Gemini 2.5 oddly having a significantly lower USAMO score for the June model (24%), though similar IMO score (~31.5%) but my guess is Deep Think would get somewhere between 37% and 45% on the IMO. 81% remains a huge jump of course.
Perhaps, perhaps not. Substantial weight was on the “no one bothers” case—no one was reporting such high scores on the USAMO (pretty similar difficulty to IMO) and the market started dropping rapidly after the USAMO date. Note that we were still at 50% odds of IMO gold a week ago—but the lack of news of anyone trying drove it down to ~26%.
Interestingly, I can find write-ups roughly predicting order of AI difficulty. Looking at gemini-2.5 pro’s result so far, using alphageometry would have guaranteed problem 2, so assuming Pro Deep Think only boosted performance on the non-geometric problems, we’d be at a 58% using deep think + alphageometry, giving Bronze and close to Silver. I think it was reasonable to assume an extra 4+ months (2 months timeline, 2 months labs being ahead of release) + more compute would have given the 2 more points to get silver.
What is surprising is that generalist LLM got better at combinatorics (problem 1) and learned to solve geometry problems well. I’m neither an AI nor math competition expert, so can’t opine whether this is a qualitative gain or just an example of a company targeting these specific problems (lots of training on math + lots of inference).
i think the IMO result is best of 32 and USAMO is not
Good point. This does update me downward on Deep Think outperforming matharena’s gemini-2.5-pro IMO run as it is possible Deep Think internally was doing a similar selection process to begin with. Difficult to know without randomly sampling gemini-2.5-pro’s answers and seeing how much the best-of-n selection lifted its score.
You sold, what changed your mind?
Misunderstood the resolution terms. ARC-AGI-2 submissions that are eligible for prizes are constrained as follows:
Grok 4 doesn’t count, and whatever frontier model beats it won’t count either. The relevant resolution criterion for frontier model performance on the task is “top score at the public leaderboard”. I haven’t found a market for that.
(You can see how the market in which I hastily made that bet didn’t move in response to Grok 4. That made me suspicious, so I actually read the details, and, well, kind of embarrassing.)
DeepMind supposedly also has gold, but the employee who said so deleted the tweet, so that’s not official yet.
Ugh. Just when I felt I could relax a bit after seeing Grok 4′s lackluster performance.
Still, this seems quite suspicious to me. Pretty much everybody is looking into test-time compute and RLVR right now. How come (seemingly) nobody else has found out about this “new general-purpose method” before OpenAI? There is clearly a huge incentive to cheat here, and OpenAI has been shown to not be particularly trustworthy when it comes to test and benchmark results.
Edit: Oh, this is also interesting: https://leanprover.zulipchat.com/#narrow/channel/219941-Machine-Learning-for-Theorem-Proving/topic/Blind.20Speculation.20about.20IMO.202025/near/529569966
“I don’t think OpenAI was one of the AI companies that agreed to cooperate with the IMO on testing their models and don’t think any of the 91 coordinators on the Sunshine Coast were involved in assessing their scripts.”
Well, someone has to be the first, and they got to RLVR itself first last September.
They have? How so?
Silently sponsoring FrontierMath and receiving access to the question sets, and, if I remember correctly, o3 and o3-mini performing worse on a later evaluation done on a newer private question set of some sort. Also whatever happened with their irreproducible ARC-AGI results and them later explicitly confirming that the model that Arc Prize got access to in December was different from the released versions, with different training and a special compute tier, despite OpenAI employees claiming that the version of o3 used in the evaluations was fully general and not tailored towards specific tasks.
Sure, but I’m just quite skeptical that it’s specifically the lab known for endless hype that does. Besides, a lot less people were looking into RLVR at the time o1-preview was released, so the situations aren’t exactly comparable.
IIRC, that worse performance was due to using a worse/less adapted agency scaffold, rather than OpenAI making the numbers up or engaging in any other egregious tampering. Regarding ARC-AGI, the December-2024 o3 and the public o3 are indeed entirely different models, but I don’t think it implies the December one was tailored for ARC-AGI.
I’m not saying OpenAI isn’t liable to exaggerate or massage their results for hype, but I don’t think they ever outright cheated (as far as we know, at least). Especially given that getting caught cheating will probably spell the AI industry’s death, and their situation doesn’t yet look so desperate that they’d risk that.
So I currently expect that the results here are, ultimately, legitimate. What I’m very skeptical about are the implications of those results that people are touting around.
Are you sure? I’m pretty sure that was cited as *one* of the possible reasons, but not confirmed anywhere. I don’t know if some minor scaffolding differences could have that much of an effect on the results (-15%?) in a math benchmark, but if they did, that should have been accounted for in the first place. I don’t think other models were tested with scaffolds specifically engineered for them getting a higher score.
As per Arc Prize and what they said OpenAI told them, the December version (“o3-preview”, as Arc Prize named it) had a compute tier above that of any publicly released model. Not only that, they say that the public version of o3 didn’t undergo any RL for ARC-AGI, “not even on the train set”. That seems suspicious to me, because once you train a model on something, you can’t easily untrain it; as per OpenAI, the ARC-AGI train set was “just a tiny fraction of the o3 train set” and, once again, the model used for evaluations is “fully general”. This means that either o3-preview was trained on the ARC-AGI train set somewhere close to the end of the training run and OpenAI was easily able to load an earlier checkpoint to undo that, then not train it on that again for unknown reasons, OR that the public version of o3 was retrained from scratch/a very early checkpoint, then again, not trained on the ARC-AGI data again for unknown reasons, OR that o3-preview was somehow specifically tailored towards ARC-AGI. The latter option seems the most likely to me, especially considering the custom compute tier used in the December evaluation.