isabel comments on FrontierMath Score of o3-mini Much Lower Than Claimed

isabel 17 Mar 2025 23:23 UTC
4 points
0
I think your Epoch link re-links to the OpenAI result, not something by Epoch.
How likely is this just that OpenAI was willing to throw absurd amounts of inference time compute at the problem set to get a good score?
- YafahEdelman 17 Mar 2025 23:42 UTC
  3 points
  4
  Parent
  Fixed the link.
  
  IMO that’s plausible but it would be pretty misleading since they described it as “o3-mini with high reasoning” and had “o3-mini (high)” in the chart and o3-mini high is what they call a specific option in ChatGPT.
  - isabel 18 Mar 2025 4:04 UTC
    7 points
    2
    Parent
    the reason why my first thought was that they used more inference is that ARC Prize specifies that that’s how they got their ARC-AGI score (https://arcprize.org/blog/oai-o3-pub-breakthrough) - my read on this graph is that they spent $300k+ on getting their score (there’s 100 questions in the semi-private eval). o3 high, not o3-mini high, but this result is pretty strong proof of concept that they’re willing to spend a lot on inference for good scores.