AIs have been demonstrating what arguably constitutes superhuman performance on FrontierMath, a set of extremely difficult mathematical problems.
AIs aren’t superhuman on frontier math. I’d guess that Terry Tao with 8 hours per problem (and internet access) is much better than current AIs. (Especially after practicing on some of the problems etc.)
At a more basic level, this superhumanness would substantially be achieved by broadness/generality rather than by being superhuman within some field (which is arguably less important/impactful). Like, if you compared AIs to a group of humans who are pretty good at this type of math, the humans would probably also destroy the AI.
Yeah, I was probably too glib here. I was extrapolating from the results of the competition Epoch organized at MIT, where “o4-mini-medium outperformed the average human team, but worse than the combined score across all teams, where we look at the fraction of problems solved by at least one team”. This was AI vs. teams of people (rather than any one individual person), and it was only o4-mini, but none of those people were Terence Tao, and it only outperformed the average team.
I would be fascinated to see how well he’d actually perform in the scenario you describe, but presumably we’re not going to find out.
if you compared AIs to a group of humans who are pretty good at this type of math, the humans would probably also destroy the AI.
I wonder? Given that, to my understanding, each FrontierMath problem is deep in a different subfield of mathematics. But I have no understanding of the craft of advanced / research mathematics, so I have no intuition here.
Anyway, I think we may be agreeing on the main point here: my suggestion that LLMs solve FrontierMath problems “the wrong way”, and your point about depth arguably being more important than breadth, seem to be pointing in the same direction.
Anyway, I think we may be agreeing on the main point here: my suggestion that LLMs solve FrontierMath problems “the wrong way”, and your point about depth arguably being more important than breadth, seem to be pointing in the same direction.
Yep, though it’s worth distinguishing between LLMs often solving FrontierMath problems the “wrong way” and always solving them the “wrong way”. My understanding is that they don’t always solve them the “wrong way” (at least for Tier 1⁄2 problems rather than Tier 3 problems), so you should (probably) be strictly more impressed than you would be if you only know that LLMs solved X% of problems the “right way”.
AIs aren’t superhuman on frontier math. I’d guess that Terry Tao with 8 hours per problem (and internet access) is much better than current AIs. (Especially after practicing on some of the problems etc.)
At a more basic level, this superhumanness would substantially be achieved by broadness/generality rather than by being superhuman within some field (which is arguably less important/impactful). Like, if you compared AIs to a group of humans who are pretty good at this type of math, the humans would probably also destroy the AI.
Yeah, I was probably too glib here. I was extrapolating from the results of the competition Epoch organized at MIT, where “o4-mini-medium outperformed the average human team, but worse than the combined score across all teams, where we look at the fraction of problems solved by at least one team”. This was AI vs. teams of people (rather than any one individual person), and it was only o4-mini, but none of those people were Terence Tao, and it only outperformed the average team.
I would be fascinated to see how well he’d actually perform in the scenario you describe, but presumably we’re not going to find out.
I wonder? Given that, to my understanding, each FrontierMath problem is deep in a different subfield of mathematics. But I have no understanding of the craft of advanced / research mathematics, so I have no intuition here.
Anyway, I think we may be agreeing on the main point here: my suggestion that LLMs solve FrontierMath problems “the wrong way”, and your point about depth arguably being more important than breadth, seem to be pointing in the same direction.
Yep, though it’s worth distinguishing between LLMs often solving FrontierMath problems the “wrong way” and always solving them the “wrong way”. My understanding is that they don’t always solve them the “wrong way” (at least for Tier 1⁄2 problems rather than Tier 3 problems), so you should (probably) be strictly more impressed than you would be if you only know that LLMs solved X% of problems the “right way”.
Good point.