He seems to have moved from “AI is decent but not that special” to “you can’t compare AI to humans” i.e. from appreciation to denial 2 in the AI cycle which tends to go:
Intrigue → Denial → Appreciation → Denial 2 → Fear
Tao’s reaction sounds like it might have something to do with this:
According to a friend, the IMO asked AI companies not to steal the spotlight from kids and to wait a week after the closing ceremony to announce results. OpenAI announced the results BEFORE the closing ceremony.
According to a Coordinator on Problem 6, the one problem OpenAI couldn’t solve, “the general sense of the IMO Jury and Coordinators is that it was rude and inappropriate” for OpenAI to do this.
OpenAI wasn’t one of the AI companies that cooperated with the IMO on testing their models, so unlike the likely upcoming Google DeepMind results, we can’t even be sure OpenAI’s “gold medal” is legit. Still, the IMO organizers directly asked OpenAI not to announce their results immediately after the olympiad.
Sadly, OpenAI desires hype and clout a lot more than it cares about letting these incredibly smart kids celebrate their achievement, and so they announced the results yesterday.
He’s saying that you can compare AI to humans, but to do that to a professional research standard, you need to meet the following criteria:
Explicit about your methodology in advance
Specific about the comparison you are making
Justify your interpretation of your results
OpenAI is currently not attempting to meet this standard. They are hyping their product. That is their perogative, but Tao is under no obligation to participate in the hype and he’s being clear about the conditions under which he would comment.
On the other hand, it’s great progress, but let’s not be hypnotized by the word “gold”. The model made it to the bottom border of the “gold medal” tier, the largest yellow bar on the histogram here https://www.imo-official.org/year_individual_r.aspx?year=2025.
It’s top 11% of the participants, so it’s great progress, but not some “exclusive and exceptional win”.
Also the shape of that histogram strongly suggests that the IMO scoring process is weird and probably adversarial (given that team leads advocate for their participants during the scoring). The fact that we see this huge peak on the histogram at 35 and that we also see local maxima at the bottoms of the other two medal tiers is suggestive of a process which is not plain “impartial and blind grading” (perhaps that part of the IMO methodology could also use some improvement).
The fact that we see this huge peak on the histogram at 35 and that we also see local maxima at the bottoms of the other two medal tiers is suggestive of a process which is not plain “impartial and blind grading” (perhaps that part of the IMO methodology could also use some improvement).
If it resembles the International Chemistry Olympiad (which, like most I[X]Os is based on the IMO) then yeah it’s weird and adversarial. But the threshold for Gold here is exactly 5⁄6 questions fully correct, which is also a natural breakpoint. This happens since generally you either have a proof or you don’t and getting n-1/n points usually means something like missing out a single case in a proof by exhaustion, which is much less common than just failing to produce a proof. Most of the people who got 35⁄42 did so with scores of 7, 7, 7, 7, 7, 0. So there’s that factor as well.
Ah, yes, you are right. And the silver medal threshold is 28=4*7. So this is much more natural, and mostly comes from how the competition is structured (the scoring factor still looks somewhat noticeable to my eye, but is much less of a problem than I thought).
But, on one hand, he is saying that proper methodology is important and expects it to be in place for the next year competition:
But most of his specific methodological issues are inapplicable here, unless OpenAI is lying: they didn’t rewrite the questions, provide tools, intervene during the run, or hand-select answers.
I don’t have a theory of Tao’s motivations, but if the post I linked is interpreted as a response to OpenAI’s result (he didn’t say it was, but he didn’t say it wasn’t and the timing makes it an obvious interpretation) raising those issues is bizarre.
First of all, we would like to see pre-registration, so that we don’t end up learning only about successes (and generally cherry-picking good results, while omitting negative results).
He is trying to steer the field towards generally better practices. I don’t think this is specifically a criticism of this particular OpenAI result, but more an attempt to change the standards.
I am sure he is not in denial. He knows that the AI systems are on the trajectory to the top and beyond.
Ok, denial is too strong a word. I don’t exactly know how to describe the mental motion he’s doing though.
By volume, his post thread is mostly discussions of ways in which this isn’t a fair comparison, whereas the correct epistemic update is more like “OK so competition maths is solved, what does this mean next?”. It’s a level of garymarcusing where he doesn’t disagree with any facts on the ground but the overall vibe of the piece totally misses the wood for the trees in a particular and consistent direction. Terry’s opinions on maths AI (which one would hope to be a useful data point) are being relegated to a lagging indicator by this mental motion.
I am sure we’ll see an easy and consistent 42 score from the models sooner rather than later, and we’ll see much more than that in the adjacent areas, but not yet :-)
(Someone who got a bronze in late 1960-s is telling me that this idea to give gold medals to 10+% of the participants is relatively recent, that when they were competing back in the 60-s there would be exactly 5 gold medals with this table of results.)
My recollection from the late 1980s when I was doing IMOs is that the proportions were supposed to be something like 6:3:2:1 nothing:bronze:silver:gold, so about 8% gold medals. I don’t think I ever actually verified this by talking to senior officials or checking the numbers.
(As for Terry Tao, I agree with you that he is clearly not in denial, he’s just cross at OpenAI for preferring PR over (1) good science and (2) politeness.)
Yeah, I actually looked at the early years today, and in 1969 only the three perfect scores won gold, and in 1970 this was relaxed a little bit, and the overall trend looked to me like there were multiple reforms with gradual relaxation of the standards for gold (although I did not do more than superficial sampling from several time points).
I think the official goal is still approximately 6:3:2:1, but this year those fuzzy boundaries resulted in 67 gold medals out of 630 participants (slightly above 10.6%).
Just me or it seems he’s making these 2 wrong assumptions? 1. He thinks this was a system (of models) like AlphaProof 2. That this model had internet access
Surprising that no one on mathstodon has mentioned this. I wonder what he would say if he knew it was a single LLM without internet access.
Honestly, that thread did initially sound kind of copium-y to me too, which I was surprised by, since his AI takes are usually pretty good[1] and level-headed. But it makes much more sense under the interpretation that this isn’t him being in denial about AI performance, but him undermining OpenAI in response to them defecting against IMO. That’s why he’s pushing the “this isn’t a fair human-AI comparison” line.
I would not characterize Tao’s usual takes on AI as particularly good (unless you compare with a relatively low baseline).
He’s been overall pretty conservative and mostly stuck to reasonable claims about current AI. So there’s not much to criticize in particular, but it has come at the cost of him not appreciating the possible/likely trajectories of where things are going, which I think misses the forest for the trees.
Terrence Tao has been on mathstodon in response:
https://mathstodon.xyz/@tao/114881418225852441
He seems to have moved from “AI is decent but not that special” to “you can’t compare AI to humans” i.e. from appreciation to denial 2 in the AI cycle which tends to go:
Intrigue → Denial → Appreciation → Denial 2 → Fear
Tao’s reaction sounds like it might have something to do with this:
He’s saying that you can compare AI to humans, but to do that to a professional research standard, you need to meet the following criteria:
Explicit about your methodology in advance
Specific about the comparison you are making
Justify your interpretation of your results
OpenAI is currently not attempting to meet this standard. They are hyping their product. That is their perogative, but Tao is under no obligation to participate in the hype and he’s being clear about the conditions under which he would comment.
I am sure he is not in denial. He knows that the AI systems are on the trajectory to the top and beyond.
But, on one hand, he is saying that proper methodology is important and expects it to be in place for the next year competition: https://mathstodon.xyz/@tao/114877789298562646.
On the other hand, it’s great progress, but let’s not be hypnotized by the word “gold”. The model made it to the bottom border of the “gold medal” tier, the largest yellow bar on the histogram here https://www.imo-official.org/year_individual_r.aspx?year=2025.
It’s top 11% of the participants, so it’s great progress, but not some “exclusive and exceptional win”.
Also the shape of that histogram strongly suggests that the IMO scoring process is weird and probably adversarial (given that team leads advocate for their participants during the scoring). The fact that we see this huge peak on the histogram at 35 and that we also see local maxima at the bottoms of the other two medal tiers is suggestive of a process which is not plain “impartial and blind grading” (perhaps that part of the IMO methodology could also use some improvement).
If it resembles the International Chemistry Olympiad (which, like most I[X]Os is based on the IMO) then yeah it’s weird and adversarial. But the threshold for Gold here is exactly 5⁄6 questions fully correct, which is also a natural breakpoint. This happens since generally you either have a proof or you don’t and getting n-1/n points usually means something like missing out a single case in a proof by exhaustion, which is much less common than just failing to produce a proof. Most of the people who got 35⁄42 did so with scores of 7, 7, 7, 7, 7, 0. So there’s that factor as well.
Ah, yes, you are right. And the silver medal threshold is 28=4*7. So this is much more natural, and mostly comes from how the competition is structured (the scoring factor still looks somewhat noticeable to my eye, but is much less of a problem than I thought).
But most of his specific methodological issues are inapplicable here, unless OpenAI is lying: they didn’t rewrite the questions, provide tools, intervene during the run, or hand-select answers.
I don’t have a theory of Tao’s motivations, but if the post I linked is interpreted as a response to OpenAI’s result (he didn’t say it was, but he didn’t say it wasn’t and the timing makes it an obvious interpretation) raising those issues is bizarre.
First of all, we would like to see pre-registration, so that we don’t end up learning only about successes (and generally cherry-picking good results, while omitting negative results).
He is trying to steer the field towards generally better practices. I don’t think this is specifically a criticism of this particular OpenAI result, but more an attempt to change the standards.
Although he is likely to have some degree of solidarity with the IMO viewpoint and to share some of their annoyance with timing of all this, e.g. https://www.reddit.com/r/math/comments/1m3uqi0/comment/n40qbe9/
Ok, denial is too strong a word. I don’t exactly know how to describe the mental motion he’s doing though.
By volume, his post thread is mostly discussions of ways in which this isn’t a fair comparison, whereas the correct epistemic update is more like “OK so competition maths is solved, what does this mean next?”. It’s a level of garymarcusing where he doesn’t disagree with any facts on the ground but the overall vibe of the piece totally misses the wood for the trees in a particular and consistent direction. Terry’s opinions on maths AI (which one would hope to be a useful data point) are being relegated to a lagging indicator by this mental motion.
I would not say it is solved :-)
I am sure we’ll see an easy and consistent 42 score from the models sooner rather than later, and we’ll see much more than that in the adjacent areas, but not yet :-)
(Someone who got a bronze in late 1960-s is telling me that this idea to give gold medals to 10+% of the participants is relatively recent, that when they were competing back in the 60-s there would be exactly 5 gold medals with this table of results.)
My recollection from the late 1980s when I was doing IMOs is that the proportions were supposed to be something like 6:3:2:1 nothing:bronze:silver:gold, so about 8% gold medals. I don’t think I ever actually verified this by talking to senior officials or checking the numbers.
(As for Terry Tao, I agree with you that he is clearly not in denial, he’s just cross at OpenAI for preferring PR over (1) good science and (2) politeness.)
Yeah, I actually looked at the early years today, and in 1969 only the three perfect scores won gold, and in 1970 this was relaxed a little bit, and the overall trend looked to me like there were multiple reforms with gradual relaxation of the standards for gold (although I did not do more than superficial sampling from several time points).
I think the official goal is still approximately 6:3:2:1, but this year those fuzzy boundaries resulted in 67 gold medals out of 630 participants (slightly above 10.6%).
Just me or it seems he’s making these 2 wrong assumptions?
1. He thinks this was a system (of models) like AlphaProof
2. That this model had internet access
Surprising that no one on mathstodon has mentioned this. I wonder what he would say if he knew it was a single LLM without internet access.
Honestly, that thread did initially sound kind of copium-y to me too, which I was surprised by, since his AI takes are usually pretty good[1] and level-headed. But it makes much more sense under the interpretation that this isn’t him being in denial about AI performance, but him undermining OpenAI in response to them defecting against IMO. That’s why he’s pushing the “this isn’t a fair human-AI comparison” line.
Edit: For someone who doesn’t “feel the ASI”, I mean.
I would not characterize Tao’s usual takes on AI as particularly good (unless you compare with a relatively low baseline).
He’s been overall pretty conservative and mostly stuck to reasonable claims about current AI. So there’s not much to criticize in particular, but it has come at the cost of him not appreciating the possible/likely trajectories of where things are going, which I think misses the forest for the trees.
Oh, yeah, he’s not superintelligence-pilled or anything. I was implicitly comparing with a relatively low baseline, yes.
Your “Intrigue → Denial → Appreciation → Denial 2 → Fear”—cycle really hits the spot. I will filter future comments from AI pundit through this lens.