Note that the likely known SOTA was even higher than 30%. Google never released Gemini 2.5 Pro Deep think, which they claimed scored 49% on the USAMO (vs. 34.5% for Gemini 2.5-05-06). Little hard to convert this to an implied IMO score (especially because matharena.ai has Gemini 2.5 oddly having a significantly lower USAMO score for the June model (24%), though similar IMO score (~31.5%) but my guess is Deep Think would get somewhere between 37% and 45% on the IMO. 81% remains a huge jump of course.
Hmm, I think there’s a systemic EMH failure here.
Perhaps, perhaps not. Substantial weight was on the “no one bothers” case—no one was reporting such high scores on the USAMO (pretty similar difficulty to IMO) and the market started dropping rapidly after the USAMO date. Note that we were still at 50% odds of IMO gold a week ago—but the lack of news of anyone trying drove it down to ~26%.
Interestingly, I can find write-ups roughly predicting order of AI difficulty. Looking at gemini-2.5 pro’s result so far, using alphageometry would have guaranteed problem 2, so assuming Pro Deep Think only boosted performance on the non-geometric problems, we’d be at a 58% using deep think + alphageometry, giving Bronze and close to Silver. I think it was reasonable to assume an extra 4+ months (2 months timeline, 2 months labs being ahead of release) + more compute would have given the 2 more points to get silver.
What is surprising is that generalist LLM got better at combinatorics (problem 1) and learned to solve geometry problems well. I’m neither an AI nor math competition expert, so can’t opine whether this is a qualitative gain or just an example of a company targeting these specific problems (lots of training on math + lots of inference).
Good point. This does update me downward on Deep Think outperforming matharena’s gemini-2.5-pro IMO run as it is possible Deep Think internally was doing a similar selection process to begin with. Difficult to know without randomly sampling gemini-2.5-pro’s answers and seeing how much the best-of-n selection lifted its score.
Note that the likely known SOTA was even higher than 30%. Google never released Gemini 2.5 Pro Deep think, which they claimed scored 49% on the USAMO (vs. 34.5% for Gemini 2.5-05-06). Little hard to convert this to an implied IMO score (especially because matharena.ai has Gemini 2.5 oddly having a significantly lower USAMO score for the June model (24%), though similar IMO score (~31.5%) but my guess is Deep Think would get somewhere between 37% and 45% on the IMO. 81% remains a huge jump of course.
Perhaps, perhaps not. Substantial weight was on the “no one bothers” case—no one was reporting such high scores on the USAMO (pretty similar difficulty to IMO) and the market started dropping rapidly after the USAMO date. Note that we were still at 50% odds of IMO gold a week ago—but the lack of news of anyone trying drove it down to ~26%.
Interestingly, I can find write-ups roughly predicting order of AI difficulty. Looking at gemini-2.5 pro’s result so far, using alphageometry would have guaranteed problem 2, so assuming Pro Deep Think only boosted performance on the non-geometric problems, we’d be at a 58% using deep think + alphageometry, giving Bronze and close to Silver. I think it was reasonable to assume an extra 4+ months (2 months timeline, 2 months labs being ahead of release) + more compute would have given the 2 more points to get silver.
What is surprising is that generalist LLM got better at combinatorics (problem 1) and learned to solve geometry problems well. I’m neither an AI nor math competition expert, so can’t opine whether this is a qualitative gain or just an example of a company targeting these specific problems (lots of training on math + lots of inference).
i think the IMO result is best of 32 and USAMO is not
Good point. This does update me downward on Deep Think outperforming matharena’s gemini-2.5-pro IMO run as it is possible Deep Think internally was doing a similar selection process to begin with. Difficult to know without randomly sampling gemini-2.5-pro’s answers and seeing how much the best-of-n selection lifted its score.