I’m confused about the following: o3-mini-2025-01-31-high scores 11% on FrontierMath-2025-02-28-Private (290 questions), but 40% on FrontierMath-2025-02-28-Public (10 questions). The latter score is higher than OpenAI’s reported 32% on FrontierMath-2024-11-26 (180 questions), which is surprising considering that OpenAI probably has better elicitation strategies and is willing to throw more compute at the task. Is this because: a) the public dataset is only 10 questions, so there is some sampling bias going on b) the dataset from 2024-11-26 is somehow significantly harder
I’m confused about the following: o3-mini-2025-01-31-high scores 11% on FrontierMath-2025-02-28-Private (290 questions), but 40% on FrontierMath-2025-02-28-Public (10 questions). The latter score is higher than OpenAI’s reported 32% on FrontierMath-2024-11-26 (180 questions), which is surprising considering that OpenAI probably has better elicitation strategies and is willing to throw more compute at the task. Is this because:
a) the public dataset is only 10 questions, so there is some sampling bias going on
b) the dataset from 2024-11-26 is somehow significantly harder