Yeah, I was probably too glib here. I was extrapolating from the results of the competition Epoch organized at MIT, where “o4-mini-medium outperformed the average human team, but worse than the combined score across all teams, where we look at the fraction of problems solved by at least one team”. This was AI vs. teams of people (rather than any one individual person), and it was only o4-mini, but none of those people were Terence Tao, and it only outperformed the average team.
I would be fascinated to see how well he’d actually perform in the scenario you describe, but presumably we’re not going to find out.
if you compared AIs to a group of humans who are pretty good at this type of math, the humans would probably also destroy the AI.
I wonder? Given that, to my understanding, each FrontierMath problem is deep in a different subfield of mathematics. But I have no understanding of the craft of advanced / research mathematics, so I have no intuition here.
Anyway, I think we may be agreeing on the main point here: my suggestion that LLMs solve FrontierMath problems “the wrong way”, and your point about depth arguably being more important than breadth, seem to be pointing in the same direction.
Anyway, I think we may be agreeing on the main point here: my suggestion that LLMs solve FrontierMath problems “the wrong way”, and your point about depth arguably being more important than breadth, seem to be pointing in the same direction.
Yep, though it’s worth distinguishing between LLMs often solving FrontierMath problems the “wrong way” and always solving them the “wrong way”. My understanding is that they don’t always solve them the “wrong way” (at least for Tier 1⁄2 problems rather than Tier 3 problems), so you should (probably) be strictly more impressed than you would be if you only know that LLMs solved X% of problems the “right way”.
Yeah, I was probably too glib here. I was extrapolating from the results of the competition Epoch organized at MIT, where “o4-mini-medium outperformed the average human team, but worse than the combined score across all teams, where we look at the fraction of problems solved by at least one team”. This was AI vs. teams of people (rather than any one individual person), and it was only o4-mini, but none of those people were Terence Tao, and it only outperformed the average team.
I would be fascinated to see how well he’d actually perform in the scenario you describe, but presumably we’re not going to find out.
I wonder? Given that, to my understanding, each FrontierMath problem is deep in a different subfield of mathematics. But I have no understanding of the craft of advanced / research mathematics, so I have no intuition here.
Anyway, I think we may be agreeing on the main point here: my suggestion that LLMs solve FrontierMath problems “the wrong way”, and your point about depth arguably being more important than breadth, seem to be pointing in the same direction.
Yep, though it’s worth distinguishing between LLMs often solving FrontierMath problems the “wrong way” and always solving them the “wrong way”. My understanding is that they don’t always solve them the “wrong way” (at least for Tier 1⁄2 problems rather than Tier 3 problems), so you should (probably) be strictly more impressed than you would be if you only know that LLMs solved X% of problems the “right way”.
Good point.