LLM Gold on the IMO was predictable using METR HCAST extrapolation:
o3′s 80% success time-horizon was 20 minutes.
o3 came out in ~3 months ago. Add 6 months for the lab-to-public delay: 9 months of progress.
This is ~3 doublings in the current RLVR scaling paradigm with a buff for being mathematics (more verifiable) specific rather than ML (~4 month doubling time → 3 month doubling time).
3 doublings of 20 minutes gets us to 160 minutes (-> 40 → 80 → 160)
IMO participants get an average of 90 minutes per problem.
The gold medal cutoff at the IMO 2025 was 35 out of 42 points (~83%)
So, by trusting HCAST extrapolation, we could have predicted that a pure LLM system getting gold was not unlikely.
The phrase “was predictable” sets off alarm bells for post facto wiggling .
If it’s predicted, I would expect you to say “was predicted”. If it wasn’t predicted due to somebody applying the model wrong, then I would expect you to say “should have been predicted”.
I feel like looking at unreleased models for doubling time mucks things up a bit. For instance I’m assuming the unreleased o3 model from December had a significantly longer time-horizon in math than the released o3, given its much higher benchmarks in FrontierMath, etc.
I don’t think you can just start at the HCAST timeline for software engineering and map it to IMO problems.
Alternative bearish prediction would be deepthink got 50% on May 20 (not released, lab frontier) on USAMO. 80% is ~4x the task time of 50% ones (at least for software engineering—not sure what it is for math), so we needed two doublings (6 months) to pull this off and instead only have ~0.67.
I also forward-predicted it based specifically on METR research.
I think many thresholds in machine learning and mathematics can be analysed this way. The main barriers are (a) modeling hyperexponentiality as you get further out in time and (b) modeling things like RLVR hitting the compute ceiling, the new RL methods being hinted at by OA, etc.
Let’s take replacement of AI researchers as an example. AI research tasks in frontier labs rarely extend beyond 200 hour of working hours. Let’s ask for an 50% success rate. Let’s assume 3 month doubling times to take into account hyperexponential progress. Current time horizon is 180 minutes. We need 12000 minutes. We need 6 doubles (-1-> 360 −2-> 720 −3-> 1500 −4-> 3000 −5-> 6000 −6-> 12000). 6 * 3 = 18 months = early 2027. Therefore, AI-researcher level AIs by early 2027 seems not unlikely.
What are your predictions for OSWorld on Dec 31 of this year? Current SOTA is 45%. Of the 73 example tasks shown on the OSWorld data explorer, the 45th percentile task takes ~27 actions to complete, and we’ve got about two 3-month periods between now and EOY, so by a naive extrapolation we’d expect tasks up to about 100 steps to be solved by EOY. That’d be about 80%.
That sounds quite high to me—and to Manifold as well, it seems. Do you endorse that prediction, or is there additional nuance to your prediction method that I’m not taking into account?
OSWorld isn’t in machine learning or mathematics, so we don’t have much data to go on.
But what we do have suggests ~4 month doubling time from which we arrive at an ~8 minute 50% time horizon by EOY, Given: > # Difficulty Split: Easy (<60s): 28.72%, Medium (60-180s): 40.11%, Hard (>180s): 30.17%
This does suggest greater than 80% by EOY, but this depends on model release cadence etc.
LLM Gold on the IMO was predictable using METR HCAST extrapolation:
o3′s 80% success time-horizon was 20 minutes.
o3 came out in ~3 months ago. Add 6 months for the lab-to-public delay: 9 months of progress.
This is ~3 doublings in the current RLVR scaling paradigm with a buff for being mathematics (more verifiable) specific rather than ML (~4 month doubling time → 3 month doubling time).
3 doublings of 20 minutes gets us to 160 minutes (-> 40 → 80 → 160)
IMO participants get an average of 90 minutes per problem.
The gold medal cutoff at the IMO 2025 was 35 out of 42 points (~83%)
So, by trusting HCAST extrapolation, we could have predicted that a pure LLM system getting gold was not unlikely.
Edit: some unstated premises of this analysis:
80% doubling times are similar to 50% doubling times (see https://arxiv.org/pdf/2503.14499)
math horizons are generally above HCAST (https://x.com/METR_Evals/status/1944817692294439179)
math doubling times are generally shorter than HCAST’s (https://x.com/METR_Evals/status/1944817692294439179)
The phrase “was predictable” sets off alarm bells for post facto wiggling .
If it’s predicted, I would expect you to say “was predicted”. If it wasn’t predicted due to somebody applying the model wrong, then I would expect you to say “should have been predicted”.
I feel like looking at unreleased models for doubling time mucks things up a bit. For instance I’m assuming the unreleased o3 model from December had a significantly longer time-horizon in math than the released o3, given its much higher benchmarks in FrontierMath, etc.
Can you be more specific about what you think the issue is?
I don’t think you can just start at the HCAST timeline for software engineering and map it to IMO problems.
Alternative bearish prediction would be deepthink got 50% on May 20 (not released, lab frontier) on USAMO. 80% is ~4x the task time of 50% ones (at least for software engineering—not sure what it is for math), so we needed two doublings (6 months) to pull this off and instead only have ~0.67.
Any other thresholds you think are comparably predictable? I’m skeptical about back-predictions like this.
I also forward-predicted it based specifically on METR research.
I think many thresholds in machine learning and mathematics can be analysed this way. The main barriers are (a) modeling hyperexponentiality as you get further out in time and (b) modeling things like RLVR hitting the compute ceiling, the new RL methods being hinted at by OA, etc.
Let’s take replacement of AI researchers as an example. AI research tasks in frontier labs rarely extend beyond 200 hour of working hours. Let’s ask for an 50% success rate. Let’s assume 3 month doubling times to take into account hyperexponential progress. Current time horizon is 180 minutes. We need 12000 minutes. We need 6 doubles (-1-> 360 −2-> 720 −3-> 1500 −4-> 3000 −5-> 6000 −6-> 12000). 6 * 3 = 18 months = early 2027. Therefore, AI-researcher level AIs by early 2027 seems not unlikely.
What are your predictions for OSWorld on Dec 31 of this year? Current SOTA is 45%. Of the 73 example tasks shown on the OSWorld data explorer, the 45th percentile task takes ~27 actions to complete, and we’ve got about two 3-month periods between now and EOY, so by a naive extrapolation we’d expect tasks up to about 100 steps to be solved by EOY. That’d be about 80%.
That sounds quite high to me—and to Manifold as well, it seems. Do you endorse that prediction, or is there additional nuance to your prediction method that I’m not taking into account?
The most naive method would be to use the extrapolation based on the trend on OSWorld here: https://www.lesswrong.com/posts/6KcP7tEe5hgvHbrSF/metr-how-does-time-horizon-vary-across-domains. My guess is that this yields sane results.
The main delta is probably a slower doubling time.
OSWorld isn’t in machine learning or mathematics, so we don’t have much data to go on.
But what we do have suggests ~4 month doubling time from which we arrive at an ~8 minute 50% time horizon by EOY, Given:
> # Difficulty Split: Easy (<60s): 28.72%, Medium (60-180s): 40.11%, Hard (>180s): 30.17%
This does suggest greater than 80% by EOY, but this depends on model release cadence etc.