Mateusz Bagiński comments on Five Hinge‑Questions That Decide Whether AGI Is Five Years Away or Twenty

Mateusz Bagiński 7 May 2025 10:28 UTC
4 points
1
I think the author meant that they achieve higher scores on the FrontierMath benchmark.
- O O 9 May 2025 16:02 UTC
  2 points
  1
  Parent
  Do they? I thought they do well on the easier section
  - RogerDearnaley 26 May 2025 3:50 UTC
    2 points
    0
    Parent
    Terrence Tao (who should be position to know) was involved in evaluating the o1 model (which is by now somewhat dated). In the context of acting as a research assistant, he described it as equivalent to a “mediocre but not entirely incompetent” graduate student. That’s not “better than basically all human mathematicians” — but it’s also not so very far off, if it’s about as good as the grade of gradate students that Terrence Tao has access to as research assistants.
    - Mo Putera 26 May 2025 4:21 UTC
      5 points
      3
      Parent
      To add a bit of nuance/context, here’s what Tao said:
      In https://chatgpt.com/share/94152e76-7511-4943-9d99-1118267f4b2b I gave the new model a challenging complex analysis problem (which I had previously asked GPT4 to assist in writing up a proof of in https://chatgpt.com/share/63c5774a-d58a-47c2-9149-362b05e268b4 ). Here the results were better than previous models, but still slightly disappointing: the new model could work its way to a correct (and well-written) solution *if* provided a lot of hints and prodding, but did not generate the key conceptual ideas on its own, and did make some non-trivial mistakes.
      The experience seemed roughly on par with trying to advise a mediocre, but not completely incompetent, (static simulation of a) graduate student. However, this was an improvement over previous models, whose capability was closer to an actually incompetent (static simulation of a) graduate student.
      It may only take one or two further iterations of improved capability (and integration with other tools, such as computer algebra packages and proof assistants) until the level of “(static simulation of a) competent graduate student” is reached, at which point I could see this tool being of significant use in research level tasks. (2/3)
      More on the “static simulation” part:
      I am belatedly realizing that in my attempts to describe my evaluation of the capability of an AI tool, I inadvertently gave the incorrect (and potentially harmful) impression that human graduate students could be reductively classified according to a static, one dimensional level of “competence”. This was not my intent at all; and I would therefore like to make the following clarifying remarks.
      Firstly, the ability to contribute to an existing research project is only one aspect of graduate study, and a relatively minor one at that. A student who is not especially effective in this regard, but excels in other dimensions such as creativity, independence, curiosity, exposition, intuition, professionalism, work ethic, organization, or social skills can in fact end up being a far more successful and impactful mathematician than one who is proficient at assigned technical tasks but has weaknesses in other areas.
      Secondly, and perhaps more importantly, human students learn and grow during their studies, and areas in which they initially struggle with can become ones in which they are quite proficient at after a few years; and personally I find being able to assist students in such transitions to be one of the most rewarding aspects of my profession. In contrast, while modern AI tools have some ability to incorporate feedback into their responses, each individual model does not truly have the capability for long term growth, and so can be sensibly evaluated using static metrics of performance. However, I believe such a fixed mindset is not an appropriate framework for judging human students, and I apologize for conveying such an impression.
      These additional remarks by Tao on long-term growth and non-problem-solving skills relevant to mathematical excellence are what I think of when I consider the hypothetical that math AIs are maxing out FrontierMath Tier 4 and yet still nowhere near revolutionising pure math, which I think is increasingly plausible, cf. all the posts sharing this one’s vibe. (Writing this publicly to revisit in case I’m wrong, which would be great; unlike say Gowers, I do want agentic artificial super-mathematicians of all kinds.)