james oofou comments on Some lessons from the OpenAI-FrontierMath debacle

james oofou 19 Jan 2025 23:24 UTC
4 points
0
It might as well be possible that o3 solved problems only from the first tier, which is nowhere near as groundbreaking as solving the harder problems from the benchmark
This doesn’t appear to be the case:
https://x.com/elliotglazer/status/1871812179399479511
of the problems we’ve seen models solve, about 40% have been Tier 1, 50% Tier 2, and 10% Tier 3
- 7vik 19 Jan 2025 23:39 UTC
  5 points
  0
  Parent
  I don’t think this info was about o3 (please correct me if I’m wrong). While this suggests not all of them were from the first tier, it would be much better to know what it actually was. Especially, since the most famous quotes about FrontierMath (“extremely challenging” and “resist AIs for several years at least”) were about the top 25% hardest problems, the accuracy on that set seems more important to update on with them. (not to say that 25% is a small feat in any case).
  - james oofou 20 Jan 2025 9:40 UTC
    2 points
    0
    Parent
    Although it’s not made explicit, we can deduce that it’s at least in part about o3 from this earlier Tweet from the same person:
    https://x.com/ElliotGlazer/status/1870613418644025442
    ³⁄₉ Although o3 solved problems in all three tiers, it likely still struggles on the most formidable Tier 3 tasks—those “exceptionally hard” challenges that Tao and Gowers say can stump even top mathematicians.