On the other hand, it’s great progress, but let’s not be hypnotized by the word “gold”. The model made it to the bottom border of the “gold medal” tier, the largest yellow bar on the histogram here https://www.imo-official.org/year_individual_r.aspx?year=2025.
It’s top 11% of the participants, so it’s great progress, but not some “exclusive and exceptional win”.
Also the shape of that histogram strongly suggests that the IMO scoring process is weird and probably adversarial (given that team leads advocate for their participants during the scoring). The fact that we see this huge peak on the histogram at 35 and that we also see local maxima at the bottoms of the other two medal tiers is suggestive of a process which is not plain “impartial and blind grading” (perhaps that part of the IMO methodology could also use some improvement).
The fact that we see this huge peak on the histogram at 35 and that we also see local maxima at the bottoms of the other two medal tiers is suggestive of a process which is not plain “impartial and blind grading” (perhaps that part of the IMO methodology could also use some improvement).
If it resembles the International Chemistry Olympiad (which, like most I[X]Os is based on the IMO) then yeah it’s weird and adversarial. But the threshold for Gold here is exactly 5⁄6 questions fully correct, which is also a natural breakpoint. This happens since generally you either have a proof or you don’t and getting n-1/n points usually means something like missing out a single case in a proof by exhaustion, which is much less common than just failing to produce a proof. Most of the people who got 35⁄42 did so with scores of 7, 7, 7, 7, 7, 0. So there’s that factor as well.
Ah, yes, you are right. And the silver medal threshold is 28=4*7. So this is much more natural, and mostly comes from how the competition is structured (the scoring factor still looks somewhat noticeable to my eye, but is much less of a problem than I thought).
But, on one hand, he is saying that proper methodology is important and expects it to be in place for the next year competition:
But most of his specific methodological issues are inapplicable here, unless OpenAI is lying: they didn’t rewrite the questions, provide tools, intervene during the run, or hand-select answers.
I don’t have a theory of Tao’s motivations, but if the post I linked is interpreted as a response to OpenAI’s result (he didn’t say it was, but he didn’t say it wasn’t and the timing makes it an obvious interpretation) raising those issues is bizarre.
First of all, we would like to see pre-registration, so that we don’t end up learning only about successes (and generally cherry-picking good results, while omitting negative results).
He is trying to steer the field towards generally better practices. I don’t think this is specifically a criticism of this particular OpenAI result, but more an attempt to change the standards.
I am sure he is not in denial. He knows that the AI systems are on the trajectory to the top and beyond.
Ok, denial is too strong a word. I don’t exactly know how to describe the mental motion he’s doing though.
By volume, his post thread is mostly discussions of ways in which this isn’t a fair comparison, whereas the correct epistemic update is more like “OK so competition maths is solved, what does this mean next?”. It’s a level of garymarcusing where he doesn’t disagree with any facts on the ground but the overall vibe of the piece totally misses the wood for the trees in a particular and consistent direction. Terry’s opinions on maths AI (which one would hope to be a useful data point) are being relegated to a lagging indicator by this mental motion.
I am sure we’ll see an easy and consistent 42 score from the models sooner rather than later, and we’ll see much more than that in the adjacent areas, but not yet :-)
(Someone who got a bronze in late 1960-s is telling me that this idea to give gold medals to 10+% of the participants is relatively recent, that when they were competing back in the 60-s there would be exactly 5 gold medals with this table of results.)
My recollection from the late 1980s when I was doing IMOs is that the proportions were supposed to be something like 6:3:2:1 nothing:bronze:silver:gold, so about 8% gold medals. I don’t think I ever actually verified this by talking to senior officials or checking the numbers.
(As for Terry Tao, I agree with you that he is clearly not in denial, he’s just cross at OpenAI for preferring PR over (1) good science and (2) politeness.)
Yeah, I actually looked at the early years today, and in 1969 only the three perfect scores won gold, and in 1970 this was relaxed a little bit, and the overall trend looked to me like there were multiple reforms with gradual relaxation of the standards for gold (although I did not do more than superficial sampling from several time points).
I think the official goal is still approximately 6:3:2:1, but this year those fuzzy boundaries resulted in 67 gold medals out of 630 participants (slightly above 10.6%).
I am sure he is not in denial. He knows that the AI systems are on the trajectory to the top and beyond.
But, on one hand, he is saying that proper methodology is important and expects it to be in place for the next year competition: https://mathstodon.xyz/@tao/114877789298562646.
On the other hand, it’s great progress, but let’s not be hypnotized by the word “gold”. The model made it to the bottom border of the “gold medal” tier, the largest yellow bar on the histogram here https://www.imo-official.org/year_individual_r.aspx?year=2025.
It’s top 11% of the participants, so it’s great progress, but not some “exclusive and exceptional win”.
Also the shape of that histogram strongly suggests that the IMO scoring process is weird and probably adversarial (given that team leads advocate for their participants during the scoring). The fact that we see this huge peak on the histogram at 35 and that we also see local maxima at the bottoms of the other two medal tiers is suggestive of a process which is not plain “impartial and blind grading” (perhaps that part of the IMO methodology could also use some improvement).
If it resembles the International Chemistry Olympiad (which, like most I[X]Os is based on the IMO) then yeah it’s weird and adversarial. But the threshold for Gold here is exactly 5⁄6 questions fully correct, which is also a natural breakpoint. This happens since generally you either have a proof or you don’t and getting n-1/n points usually means something like missing out a single case in a proof by exhaustion, which is much less common than just failing to produce a proof. Most of the people who got 35⁄42 did so with scores of 7, 7, 7, 7, 7, 0. So there’s that factor as well.
Ah, yes, you are right. And the silver medal threshold is 28=4*7. So this is much more natural, and mostly comes from how the competition is structured (the scoring factor still looks somewhat noticeable to my eye, but is much less of a problem than I thought).
But most of his specific methodological issues are inapplicable here, unless OpenAI is lying: they didn’t rewrite the questions, provide tools, intervene during the run, or hand-select answers.
I don’t have a theory of Tao’s motivations, but if the post I linked is interpreted as a response to OpenAI’s result (he didn’t say it was, but he didn’t say it wasn’t and the timing makes it an obvious interpretation) raising those issues is bizarre.
First of all, we would like to see pre-registration, so that we don’t end up learning only about successes (and generally cherry-picking good results, while omitting negative results).
He is trying to steer the field towards generally better practices. I don’t think this is specifically a criticism of this particular OpenAI result, but more an attempt to change the standards.
Although he is likely to have some degree of solidarity with the IMO viewpoint and to share some of their annoyance with timing of all this, e.g. https://www.reddit.com/r/math/comments/1m3uqi0/comment/n40qbe9/
Ok, denial is too strong a word. I don’t exactly know how to describe the mental motion he’s doing though.
By volume, his post thread is mostly discussions of ways in which this isn’t a fair comparison, whereas the correct epistemic update is more like “OK so competition maths is solved, what does this mean next?”. It’s a level of garymarcusing where he doesn’t disagree with any facts on the ground but the overall vibe of the piece totally misses the wood for the trees in a particular and consistent direction. Terry’s opinions on maths AI (which one would hope to be a useful data point) are being relegated to a lagging indicator by this mental motion.
I would not say it is solved :-)
I am sure we’ll see an easy and consistent 42 score from the models sooner rather than later, and we’ll see much more than that in the adjacent areas, but not yet :-)
(Someone who got a bronze in late 1960-s is telling me that this idea to give gold medals to 10+% of the participants is relatively recent, that when they were competing back in the 60-s there would be exactly 5 gold medals with this table of results.)
My recollection from the late 1980s when I was doing IMOs is that the proportions were supposed to be something like 6:3:2:1 nothing:bronze:silver:gold, so about 8% gold medals. I don’t think I ever actually verified this by talking to senior officials or checking the numbers.
(As for Terry Tao, I agree with you that he is clearly not in denial, he’s just cross at OpenAI for preferring PR over (1) good science and (2) politeness.)
Yeah, I actually looked at the early years today, and in 1969 only the three perfect scores won gold, and in 1970 this was relaxed a little bit, and the overall trend looked to me like there were multiple reforms with gradual relaxation of the standards for gold (although I did not do more than superficial sampling from several time points).
I think the official goal is still approximately 6:3:2:1, but this year those fuzzy boundaries resulted in 67 gold medals out of 630 participants (slightly above 10.6%).