But not that unpleasant, I guess. I really wonder what people think when they see a benchmark on which LLMs get 30%, and then confidently say that 80% is “years away”. Obviously if LLMs already get 30%, it proves they’re fundamentally capable of solving that task[1], so the benchmark will be saturated once AI researchers do more of the same. Hell, Gemini 2.5 Pro apparently got 5⁄7 (71%) on one of the problems, so clearly outputting 5/7-tier answers to IMO problems was a solved problem, so an LLM model getting at least 6*5 = 30 out of 42 in short order should have been expected. How was this not priced in...?
Hmm, I think there’s a systemic EMH failure here. People appear to think that the time-to-benchmark-saturation scales with the difference between the status of a human able to reach the current score and the status of a human able to reach the target score, instead of estimating it using gears-level models of how AI works. You can probably get free Manifold mana by looking at supposedly challenging benchmarks, looking at which ones have above-10% scores already, then being more bullish on them than the market.
ARC-AGI-2 seems like the obvious one.
“We reach this capability level not via narrow, task-specific methodology, but by breaking new ground in general-purpose reinforcement learning and test-time compute scaling.”
I don’t like the sound of that, but if this is their headline result, I’m still sleeping and my update is that people are bad at thinking about benchmarks.
But not that unpleasant, I guess. I really wonder what people think when they see a benchmark on which LLMs get 30%, and then confidently say that 80% is “years away”. Obviously if LLMs already get 30%, it proves they’re fundamentally capable of solving that task[1], so the benchmark will be saturated once AI researchers do more of the same. Hell, Gemini 2.5 Pro apparently got 5⁄7 (71%) on one of the problems, so clearly outputting 5/7-tier answers to IMO problems was a solved problem, so an LLM model getting at least 6*5 = 30 out of 42 in short order should have been expected. How was this not priced in...?
Agreed, I don’t really get how this could be all that much of an update. I think the cynical explanation here is probably correct, which is that most pessimism is just vibes based (as well as most optimism).
Note that the likely known SOTA was even higher than 30%. Google never released Gemini 2.5 Pro Deep think, which they claimed scored 49% on the USAMO (vs. 34.5% for Gemini 2.5-05-06). Little hard to convert this to an implied IMO score (especially because matharena.ai has Gemini 2.5 oddly having a significantly lower USAMO score for the June model (24%), though similar IMO score (~31.5%) but my guess is Deep Think would get somewhere between 37% and 45% on the IMO. 81% remains a huge jump of course.
Hmm, I think there’s a systemic EMH failure here.
Perhaps, perhaps not. Substantial weight was on the “no one bothers” case—no one was reporting such high scores on the USAMO (pretty similar difficulty to IMO) and the market started dropping rapidly after the USAMO date. Note that we were still at 50% odds of IMO gold a week ago—but the lack of news of anyone trying drove it down to ~26%.
Interestingly, I can find write-ups roughly predicting order of AI difficulty. Looking at gemini-2.5 pro’s result so far, using alphageometry would have guaranteed problem 2, so assuming Pro Deep Think only boosted performance on the non-geometric problems, we’d be at a 58% using deep think + alphageometry, giving Bronze and close to Silver. I think it was reasonable to assume an extra 4+ months (2 months timeline, 2 months labs being ahead of release) + more compute would have given the 2 more points to get silver.
What is surprising is that generalist LLM got better at combinatorics (problem 1) and learned to solve geometry problems well. I’m neither an AI nor math competition expert, so can’t opine whether this is a qualitative gain or just an example of a company targeting these specific problems (lots of training on math + lots of inference).
Good point. This does update me downward on Deep Think outperforming matharena’s gemini-2.5-pro IMO run as it is possible Deep Think internally was doing a similar selection process to begin with. Difficult to know without randomly sampling gemini-2.5-pro’s answers and seeing how much the best-of-n selection lifted its score.
Misunderstood the resolution terms. ARC-AGI-2 submissions that are eligible for prizes are constrained as follows:
Unlike the public leaderboard on arcprize.org, Kaggle rules restrict you from using internet APIs, and you only get ~$50 worth of compute per submission. In order to be eligible for prizes, contestants must open source and share their solution and work into the public domain at the end of the competition.
Grok 4 doesn’t count, and whatever frontier model beats it won’t count either. The relevant resolution criterion for frontier model performance on the task is “top score at the public leaderboard”. I haven’t found a market for that.
(You can see how the market in which I hastily made that bet didn’t move in response to Grok 4. That made me suspicious, so I actually read the details, and, well, kind of embarrassing.)
Well, that’s mildly unpleasant.
But not that unpleasant, I guess. I really wonder what people think when they see a benchmark on which LLMs get 30%, and then confidently say that 80% is “years away”. Obviously if LLMs already get 30%, it proves they’re fundamentally capable of solving that task[1], so the benchmark will be saturated once AI researchers do more of the same. Hell, Gemini 2.5 Pro apparently got 5⁄7 (71%) on one of the problems, so clearly outputting 5/7-tier answers to IMO problems was a solved problem, so an LLM model getting at least 6*5 = 30 out of 42 in short order should have been expected. How was this not priced in...?
Hmm, I think there’s a systemic EMH failure here. People appear to think that the time-to-benchmark-saturation scales with the difference between the status of a human able to reach the current score and the status of a human able to reach the target score, instead of estimating it using gears-level models of how AI works. You can probably get free Manifold mana by looking at supposedly challenging benchmarks, looking at which ones have above-10% scores already, then being more bullish on them than the market.
ARC-AGI-2 seems like the obvious one.
I don’t like the sound of that, but if this is their headline result, I’m still sleeping and my update is that people are bad at thinking about benchmarks.
Unless the benchmark has difficulty tiers the way e. g. FrontierMath does, which I think IMO doesn’t.
Agreed, I don’t really get how this could be all that much of an update. I think the cynical explanation here is probably correct, which is that most pessimism is just vibes based (as well as most optimism).
Note that the likely known SOTA was even higher than 30%. Google never released Gemini 2.5 Pro Deep think, which they claimed scored 49% on the USAMO (vs. 34.5% for Gemini 2.5-05-06). Little hard to convert this to an implied IMO score (especially because matharena.ai has Gemini 2.5 oddly having a significantly lower USAMO score for the June model (24%), though similar IMO score (~31.5%) but my guess is Deep Think would get somewhere between 37% and 45% on the IMO. 81% remains a huge jump of course.
Perhaps, perhaps not. Substantial weight was on the “no one bothers” case—no one was reporting such high scores on the USAMO (pretty similar difficulty to IMO) and the market started dropping rapidly after the USAMO date. Note that we were still at 50% odds of IMO gold a week ago—but the lack of news of anyone trying drove it down to ~26%.
Interestingly, I can find write-ups roughly predicting order of AI difficulty. Looking at gemini-2.5 pro’s result so far, using alphageometry would have guaranteed problem 2, so assuming Pro Deep Think only boosted performance on the non-geometric problems, we’d be at a 58% using deep think + alphageometry, giving Bronze and close to Silver. I think it was reasonable to assume an extra 4+ months (2 months timeline, 2 months labs being ahead of release) + more compute would have given the 2 more points to get silver.
What is surprising is that generalist LLM got better at combinatorics (problem 1) and learned to solve geometry problems well. I’m neither an AI nor math competition expert, so can’t opine whether this is a qualitative gain or just an example of a company targeting these specific problems (lots of training on math + lots of inference).
i think the IMO result is best of 32 and USAMO is not
Good point. This does update me downward on Deep Think outperforming matharena’s gemini-2.5-pro IMO run as it is possible Deep Think internally was doing a similar selection process to begin with. Difficult to know without randomly sampling gemini-2.5-pro’s answers and seeing how much the best-of-n selection lifted its score.
You sold, what changed your mind?
Misunderstood the resolution terms. ARC-AGI-2 submissions that are eligible for prizes are constrained as follows:
Grok 4 doesn’t count, and whatever frontier model beats it won’t count either. The relevant resolution criterion for frontier model performance on the task is “top score at the public leaderboard”. I haven’t found a market for that.
(You can see how the market in which I hastily made that bet didn’t move in response to Grok 4. That made me suspicious, so I actually read the details, and, well, kind of embarrassing.)