This 25m 80%-time horizon number seems like strong evidence against the superexponential model from ai-2027. On this graph the superexponential line shows 4h at the end of 2025. I feel like GPT-5 will be the biggest model release of the year, I don’t see how we would see a model with an 8x time horizon of GPT-5 this year.
The swe-bench scores are already well below trend from ai 2027. Had to hit 85% by end of month. We’re at 75%. (and SOTA was ~64% when they released ai 2027)
It seems Gemini was ahead of openai on the IMO gold. The output was more polished so presumably they achieved a gold worthy model earlier. I expect gemini’s swe bench to thus at least be ahead of OpenAI’s 75%.
I don’t believe there’s a strong correlation between mathematical ability and agentic coding tasks (as opposed to competition coding tasks where a stronger correlation exists).
Gemini 2.5 Pro is already was well ahead of O3 on IMO, but had worse swe-bench/METR scores.
Claude is relatively bad at math but has hovered around SOTA on agentic coding.
I overall agree that things seem to be going slower than AI 2027 (and my median was longer when it came out).
However as mentioned in the caption, the green curve is a simplified version of our original timelines model. Apologies about that, I think it’s reasonable to judge us based on that.
FWIW though, the central superexponential Mar 2027 trajectory from our original model certainly is not strongly contradicted by GPT-5, both with and without an AI R&D speedup interpolation issue fixed.
The original model, filtered for superexponential (pre-AI-R&D-automation) trajectories that reach superhuman coder in 2027:
With AI R&D speedup bug fixed, also filtered for superexponential pre-AI-R&D-automation (backcast looks much better, GPT-5 prediction slightly worse):
Either way, we’re now working on a much improved model which will likely have an interactive web app which will provide an improvement over this static graph, e.g. you’ll be able to try various parameter settings and see what time horizon trajectories they generate and how consistent they are with future data points.
Note also that the above trajectories are from the original model, not the May update model which we unfortunately aren’t taking the time to create for various reasons, we think it would likely look a little worse in terms of the GPT-5 fit but might depend how you filter for which trajectories count as superexponential.
Registering that I don’t expect GPT-5 to be “the biggest model release of the year,” for various reasons. I would guess (based on the cost and speed) that the model is GPT-4.1-sized. Conditional on this, the total training compute is likely to be below the state of the art.
How did you determine the cost and speed of it, given that there is no unified model that we have access to, just some router between models? Unless I’m just misunderstanding something about what GPT-5 even is.
The router is only on ChatGPT, not the API, I believe. And it switches between two models of the same size and cost (GPT-5 with thinking and GPT-5 without thinking).
This 25m 80%-time horizon number seems like strong evidence against the superexponential model from ai-2027. On this graph the superexponential line shows 4h at the end of 2025. I feel like GPT-5 will be the biggest model release of the year, I don’t see how we would see a model with an 8x time horizon of GPT-5 this year.
The swe-bench scores are already well below trend from ai 2027. Had to hit 85% by end of month. We’re at 75%. (and SOTA was ~64% when they released ai 2027)
Gemini 3 should drop by the end of the month, we might hit that
+ 25% for swe-bench relative to Gemini 2.5? Quadrupling the METR task length of Gemini 2.5?
I suppose it’s a possibility, albeit a remote one.
It seems Gemini was ahead of openai on the IMO gold. The output was more polished so presumably they achieved a gold worthy model earlier. I expect gemini’s swe bench to thus at least be ahead of OpenAI’s 75%.
I don’t believe there’s a strong correlation between mathematical ability and agentic coding tasks (as opposed to competition coding tasks where a stronger correlation exists).
Gemini 2.5 Pro is already was well ahead of O3 on IMO, but had worse swe-bench/METR scores.
Claude is relatively bad at math but has hovered around SOTA on agentic coding.
I overall agree that things seem to be going slower than AI 2027 (and my median was longer when it came out).
However as mentioned in the caption, the green curve is a simplified version of our original timelines model. Apologies about that, I think it’s reasonable to judge us based on that.
FWIW though, the central superexponential Mar 2027 trajectory from our original model certainly is not strongly contradicted by GPT-5, both with and without an AI R&D speedup interpolation issue fixed.
The original model, filtered for superexponential (pre-AI-R&D-automation) trajectories that reach superhuman coder in 2027:
With AI R&D speedup bug fixed, also filtered for superexponential pre-AI-R&D-automation (backcast looks much better, GPT-5 prediction slightly worse):
Either way, we’re now working on a much improved model which will likely have an interactive web app which will provide an improvement over this static graph, e.g. you’ll be able to try various parameter settings and see what time horizon trajectories they generate and how consistent they are with future data points.
Note also that the above trajectories are from the original model, not the May update model which we unfortunately aren’t taking the time to create for various reasons, we think it would likely look a little worse in terms of the GPT-5 fit but might depend how you filter for which trajectories count as superexponential.
Registering that I don’t expect GPT-5 to be “the biggest model release of the year,” for various reasons. I would guess (based on the cost and speed) that the model is GPT-4.1-sized. Conditional on this, the total training compute is likely to be below the state of the art.
How did you determine the cost and speed of it, given that there is no unified model that we have access to, just some router between models? Unless I’m just misunderstanding something about what GPT-5 even is.
The router is only on ChatGPT, not the API, I believe. And it switches between two models of the same size and cost (GPT-5 with thinking and GPT-5 without thinking).