Should we update against seeing relatively fast AI progress in 2025 and 2026? (Maybe (re)assess this after the GPT-5 release.)
Around the early o3 announcement (and maybe somewhat before that?), I felt like there were some reasonably compelling arguments for putting a decent amount of weight on relatively fast AI progress in 2025 (and maybe in 2026):
Maybe AI companies will be able to rapidly scale up RL further because RL compute is still pretty low (so there is a bunch of overhang here); this could cause fast progress if the companies can effectively directly RL on useful stuff or RL transfers well even from more arbitrary tasks (e.g. competition programming)
Maybe OpenAI hasn’t really tried hard to scale up RL on agentic software engineering and has instead focused on scaling up single turn RL. So, when people (either OpenAI themselves or other people like Anthropic) scale up RL on agentic software engineering, we might see rapid progress.
It seems plausible that larger pretraining runs are still pretty helpful, but prior runs have gone wrong for somewhat random reasons. So, maybe we’ll see some more successful large pretraining runs (with new improved algorithms) in 2025.
I updated against this perspective somewhat because:
The releases of 3.7 Sonnet and 4 Opus were somewhat below expectations on this perspective. It looks like there wasn’t some easy way to just actually do a bunch of RL on agentic software engineering (with reasoning?) in a way that makes a massive difference (and wasn’t already in the process of being scaled up). Or, at least Anthropic wasn’t able to pull this off; it seems plausible that Anthropic is substantially worse at RL than OpenAI (at least at some aspects of RL like effectively scaling up RL on more narrow tasks). Interestingly, reasoning doesn’t seem to help Anthropic models on agentic software engineering tasks, but does help OpenAI models.
We haven’t yet seen much better models due to more (or algorithmically improved) pretraining AFAICT.
We haven’t seen OpenAI releases that perform substantially better than o3 at software engineering yet despite o3 being announced 7 months ago. (That said, o3 was actually released only 3 months ago.)
I updated towards thinking that the training of o3 was more focused on software engineering than I previously thought (at least the final release version of o3) and the returns weren’t that big. (This is due to rumors, seeing that OpenAI was training on software engineering tasks here, and based on OpenAI releases and communication like Codex.)
I updated a bit against this perspective due to xAI seemingly scaling things up a bunch, but I don’t put as much weight on this because it seems pretty plausible they just did a bad job scaling things up. (E.g., maybe they didn’t actually scale up RL to pretraining scale or if they did, maybe this RL was mostly compute inefficient RL on lower quality environments. xAI might also just generally be algorithmically behind.)
GPT-5 is expected to be released in 0.5-3 weeks and rumors indicate that it is substantially more focused on practical (agentic) software engineering. This is (arguably) the first major model release from OpenAI since o3, and it should resolve some of our uncertainties (particularly related to whether there was/is a bunch of low hanging fruit at OpenAI due to them not being very focused on software engineering).
My expectation is that GPT-5 will be a decent amount better than o3 on agentic software engineering (both in benchmarks and in practice), but won’t be substantially above trend. In particular, my median is that it will have a 2.75 hour time horizon[1] on METR’s evaluation suite[2]. This prediction was produced by extrapolating out the faster 2024-2025 agentic software engineering time horizon trend from o3 and expecting GPT-5 will be slightly below trend.[3]
If GPT-5 is actually a large (way above trend) jump in agentic software engineering with (e.g.) a >6 hour time horizon[4] (which seems plausible but unlikely to me), then we’ll have seen relatively fast (and possibly very fast) software progress in 2025 and we’d naively expect this to continue.[5] If GPT-5 is below trend[6], then it seems like the case against expecting relatively faster AI progress in 2025/2026 due to scaling up RL focused on agentic software engineering is pretty strong.
Overall, I wonder if I have (thus far) insufficiently updated my overall timelines picture based on the observations we’ve had so far in 2025. I’m a bit worried that I’m still operating on cached beliefs when these observations should have pushed away a bunch of the shorter timelines mass. Regardless, I think that the release of GPT-5 (or really, 2-6 weeks after the release of GPT-5 so that we have a better picture of GPT-5′s capabilities) will be a good point to (re)assess and consider stronger updates.
Edit: An earlier version of this post said “3.5 hours”, but this was actually a mistake because I thought o3 had a 2 hour time horizon when it actually has a 1.5 hour time horizon. I also edited from “>8″ to “>6” at a later point in this post as “>8 hours” was meant to refer to 2 doublings from o3 which is actually “>6 hours”.
I do worry that METR’s evaluation suite will start being less meaningful and noisier for longer time horizons as the evaluation suite was built a while ago. We could instead look at 80% reliability time horizons if we have concerns about the harder/longer tasks.
The faster 2024-2025 agentic software engineering time horizon (see figure 19 in METR’s paper) has a 4 month doubling time. o3 was released 4 months before GPT-5 is expected to be released and o3 has a 1.5 hour time horizon (edit: this used to say 2 hour which was a mistake), so this yields a 3 hour time horizon for GPT-5. I think that GPT-5 is more likely than not to be below trend (on at least METR’s specific evaluation) so I round this down a bit to 2.75 hours, though I have a pretty wide confidence interval. I expect below trend rather than above trend due to some early reports about GPT-5, the trend being pretty fast, Opus 4 having lower than expected results, and thinking that the METR evaluation suite might have issues with larger time horizons that result in misleadingly lower numbers.
Again, I’d want to look at multiple metrics. I’m referring to seeing agentic software engineering performance that looks analogous to a >6 hour time horizon on METR’s evaluation suite when aggregating over multiple relevant metrics.
It seems more likely to be a massive jump if OpenAI actually wasn’t yet very focused on agentic software engineering when training o3, but is more focused on this now. This article claims that something like this is the case.
It’s harder to confidently notice that GPT-5 is below trend relative to how hard it is to tell if GPT-5 is way above trend. We should expect it’s some amount better than o3 and the difference between a 2 and a 3 hour time horizon is legitimately hard to measure.
I basically agree with this whole post. I used to think there were double-digit % chances of AGI in each of 2024 and 2025 and 2026, but now I’m more optimistic, it seems like “Just redirect existing resources and effort to scale up RL on agentic SWE” is now unlikely to be sufficient (whereas in the past we didn’t have trends to extrapolate and we had some scary big jumps like o3 to digest)
I still think there’s some juice left in that hypothesis though. Consider how in 2020, one might have thought “Now they’ll just fine-tune these models to be chatbots and it’ll become a mass consumer product” and then in mid-2022 various smart people I know were like “huh, that hasn’t happened yet, maybe LLMs are hitting a wall after all” but it turns out it just took till late 2022/early 2023 for the kinks to be worked out enough.
Also, we should have some credence on new breakthroughs e.g. neuralese, online learning, whatever. Maybe like 8%/yr? Of a breakthrough that would lead to superhuman coders within a year or two, after being appropriately scaled up and tinkered with.
Re neuralese/online or continual learning or long-term memory that isn’t solely a context window breakthrough, I’m much more skeptical of it being very easy to integrate breakthroughs on short timelines, because it’s likely that changes will have to be made to the architecture that aren’t easy to do very quickly.
The potential for breakthroughs combined with the fact that Moore’s law will continue, making lots of compute cheap for researchers is a reason I think that my median timelines aren’t in the latter half of the century, but I think that it’s much more implausible to get it working very soon, so I’m much closer to 0.3% a year from 2025-2027.
@Mo Putera@the gears to ascension take the Moore’s law will continue point as a prediction that new paradigms like memristors will launch new S-curves of efficiency until we reach the Landauer Limit, which is 6.5 OOMs away, and that the current paradigm has 200x more efficiency savings to go:
Interestingly, reasoning doesn’t seem to help Anthropic models on agentic software engineering tasks, but does help OpenAI models.
I use ‘ultrathink’ in Claude Code all the time and find that it makes a difference.
I do worry that METR’s evaluation suite will start being less meaningful and noisier for longer time horizons as the evaluation suite was built a while ago. We could instead look at 80% reliability time horizons if we have concerns about the harder/longer tasks.
I’m overall skeptical of overinterpreting/extrapolating the METR numbers. It is far too anchored on the capabilities of a single AI model, a lightweight scaffold, and a notion of ‘autonomous’ task completion of ‘human-hours’. I think this is a mental model for capabilities progress that will lead to erroneous predictions.
If you are trying to capture the absolute frontier of what is possible, you don’t only test a single-acting model in an empty codebase with limited internet access and scaffolding. I would personally be significantly less capable at agentic coding if I only used 1 model (like replicating subliminal learning in about 1 hour of work + 2 hours of waiting for fine-tunes on the day of the release) with limited access to resources. You are instead using a variety of AI models based on their pros and cons[1], with well-crafted codebases for agentic coding and giving them access to whatever they want on the internet as a reference (+ much more)[2]. METR does note this limitation, but I want to emphasize its importance and potential for misleading extrapolations if people only consider the headline charts without considering the nuance.
I think non-formal IMO gold was unexpected and we heard explicitly that it won’t be in GPT-5. So I would wait to see how it would pan out. It may not matter in 2025 but I think it can in 2026.
Why should we think that the relevant progress driving non-formal IMO is very important for plausibly important capabilities like agentic software engineering? I’d guess the transfer is relatively weak unless the IMO results were driven by general purpose advances. This seems somewhat unlikely: if the main breakthrough was in better performance on non-trivial-to-verify tasks (as various posts from OpenAI people claim), then even if this generalizes well beyond proofs this wouldn’t obviously particularly help with agentic software engineering (where the core blocker doesn’t appear to be verification difficulty).
Edit: I think I mostly retract this comment, see below.
Why should we think that the relevant progress driving non-formal IMO is very important for plausibly important capabilities like agentic software engineering? [...] if the main breakthrough was in better performance on non-trivial-to-verify tasks (as various posts from OpenAI people claim), then even if this generalizes well beyond proofs this wouldn’t obviously particularly help with agentic software engineering (where the core blocker doesn’t appear to be verification difficulty).
I’m surprised by this. To me it seems hugely important how fast AIs are improving on tasks with poor feedback loops, because obviously they’re in a much better position to improve on easy-to-verify tasks, so “tasks with poor feedback loops” seem pretty likely to be the bottleneck to an intelligence explosion.
So I definitely do think that “better performance on non-trivial-to-verify tasks” are very important for some “plausibly important capabilities”. Including agentic software engineering. (Like: This also seems related to why the AIs are much better at benchmarks than at helping people out with their day-to-day work.)
Hmm, yeah I think you’re right, though I also don’t think I articulated what I was trying to say very well.
Like I think my view is:
There was some story where we would see very fast progress in relatively easy to verify (or trivial to verify) tasks and I’m talking about that. It seems like agentic software engineering could reach very high levels without necessarily needing serious improvements in harder to verify tasks.
Faster progress in non-trivial-to-verify tasks might not be the limiting factor if progress in easy to verify tasks isn’t that fast.
I still think that there won’t be a noticable jump as the IMO methods make it into production models but this is due to more general heuristics (and the methods maybe still matter, it just won’t be something to wait for I think).
I think IMO results were driven by general purpose advances, but I agree I can’t conclusively prove it because we don’t know details. Hopefully we will learn more as time goes by.
An informal argument: I think currently agentic software engineering is blocked on context rot, among other things. I expect IMO systems to have improved on this, since IMO time control is 1.5 hours per problem.
(I’m skeptical that much of the IMO improvement was due to improving how well AIs can use their context in general. This isn’t a crux for my view, but it also seems pretty likely that the AIs didn’t do more than ~100k serial tokens of reasoning for the IMO while still aggregating over many such reasoning traces.)
GPT-5 reached 2h17m, which seems like excellent news. However, excluding spurious failures would bring GPT-5′s performance to 2h41m, which aligns with Greenblatt’s prediction. Moreover, METR evaluators themselves think that “GPT-5 could have benefitted from a larger token budget”, implying that the benchmark began to corrupt. What other relevant metrics there exist?
The AI-2027 forecast has mid-2025 agents reach 85% on SWE-bench verified and 65% on the OSWorld benchmark.
OSWorld reached 60% on August 4 if we use no filters. SWE-bench with a minimal agent has Claude Opus 4 (20250514) reach 67.6% when evaluated in August. Moreover, on August 7 the only models that SWE-bench evaluated after 1st July were Claude 4 Opus and two Chinese models. In June SWE-bench verified reached 75% with TRAE. And now TRAE claims to use Grok 4 and Kimi K2.
Grok 4 managed to fail on tasks worthy of 2-4 seconds(!!), 2-4 minutes and to experience a fiasco on 2-4 hours long tasks. Page 22 of the METR paper could imply that the dataset contains few tasks that are 2-4 hrs long. If tasks worthy of 2-4 seconds, minutes or hours “sandbagged” Grok’s 80% time horizon to 15 minutes, then the metric underestimates Grok’s true capabilities.
While there are no estimates of Gemini 2.5-Deep Think, which was released on August 1, IIRC a LessWronger claimed that the public version received a bronze medal on IMO 2025. Another LessWronger claimed that “Gemini was ahead of openai on the IMO gold. The output was more polished so presumably they achieved a gold worthy model earlier. I expect gemini’s swe bench to thus at least be ahead of OpenAI’s 75%. ”
To conclude, I doubt that we still have benchmarks that can be relied upon to quickly estimate the models’ capabilities: SWE-bench and OSWorld are likely too slow, METR began to fill with noise. While we do have ARC-AGI yet, Grok’s success could have demonstrated the ability to gamble it. And that’s ignoring Claude’s potential improvements after Opus 4.1...
EDIT: TRAE uses an unknown scaffolding. However, applying mini-SWE-agent to Claude 4 Opus (20250514) yields better results than GPT-5, implying that other benchmarks might also increase after the Claude Opus 4 update to 4.1 and future updates.
My expectation is that GPT-5 will be a decent amount better than o3 on agentic software engineering (both in benchmarks and in practice), but won’t be substantially above trend. In particular, my median is that it will have a 2.75 hour time horizon[1] on METR’s evaluation suite[2]. This prediction was produced by extrapolating out the faster 2024-2025 agentic software engineering time horizon trend from o3 and expecting GPT-5 will be slightly below trend.[3]
If the correlations continue to hold, this would map to something like a 78% to 80% range on swe-bench pass @ 1 (which is likely to be announced at release). I’m personally not this bearish (I’d guess low 80s given that benchmark has reliably jumped ~3.5% monthly), but we shall see.
Needless to say if it scores 80%, we are well below AI 2027 timeline predictions with high confidence.
The data is pretty low-quality for that graph because the agents we used were inconsistent and Claude 3-level models could barely solve any tasks. Epoch has better data for SWE-bench Verified, which I converted to time horizon here and found to also be doubling every 4 months ish. Their elicitation is probably not as good for OpenAI as Anthropic models, but both are increasing at similar rates.
I think that even before the release of GPT-5 and setting aside Grok 4′s problems I have a weak case against non-neuralese AI progress being likely to be fast. Recall the METR measurements.
The time horizon of base LLMs experienced a slowdown or plateau[1] between GPT-4 (5 minutes, Mar′23) and GPT-4o (9 min, May ’24).
Evaluation of Chinese models has DeepSeek’s time horizons[2] change only from 18 to 31 minutes between[3] V3 (Dec ’24) and R1-0528 (May ’25).
While Grok 4 was likely trained incompetently[4] and/or for the benchmarks, its 50% time horizon is 1.83 hrs (vs. o3′s 1.54 hrs) and 80% time horizon is 15 min (vs. o3′s 20 min) In other words, Grok 4′s performance is comparable with that of o3.
Taken together, two plateaus and Grok 4′s failure suggest a troubling pattern: creation of an AGI is likely to require[5] neuralese, which will likely prevent the humans from noticing misalignment.
Alas, METR’s evaluation of DeepSeek’s capabilities might have missed “agent scaffolds which could elicit the capabilities of the evaluated models much more effectively”. If there exists an alternate scaffold where R1-0528 becomes a capable agent and V3 doesn’t, then DeepSeek’s models are not on a plateau.
In addition, DeepSeek V3 released in December didn’t use a CoT. If the main ingredient necessary for capabilities increase is a MoE, not the CoT, then what can be said about Kimi K2?
Grok 4 could have also been deliberately trained on complex tasks, which might have made the success rate less time-dependent. After all, it did reach 16% on the ARC-AGI-2 benchmark.
There is, however, Knight Lee’s proposal or the creation of many agents having access to each other’s CoTs and working in parallel. While Grok 4 Heavy could be a step in this direction, the agents receive access to each other’s CoTs after they finish the work.
Should we update against seeing relatively fast AI progress in 2025 and 2026? (Maybe (re)assess this after the GPT-5 release.)
Around the early o3 announcement (and maybe somewhat before that?), I felt like there were some reasonably compelling arguments for putting a decent amount of weight on relatively fast AI progress in 2025 (and maybe in 2026):
Maybe AI companies will be able to rapidly scale up RL further because RL compute is still pretty low (so there is a bunch of overhang here); this could cause fast progress if the companies can effectively directly RL on useful stuff or RL transfers well even from more arbitrary tasks (e.g. competition programming)
Maybe OpenAI hasn’t really tried hard to scale up RL on agentic software engineering and has instead focused on scaling up single turn RL. So, when people (either OpenAI themselves or other people like Anthropic) scale up RL on agentic software engineering, we might see rapid progress.
It seems plausible that larger pretraining runs are still pretty helpful, but prior runs have gone wrong for somewhat random reasons. So, maybe we’ll see some more successful large pretraining runs (with new improved algorithms) in 2025.
I updated against this perspective somewhat because:
The releases of 3.7 Sonnet and 4 Opus were somewhat below expectations on this perspective. It looks like there wasn’t some easy way to just actually do a bunch of RL on agentic software engineering (with reasoning?) in a way that makes a massive difference (and wasn’t already in the process of being scaled up). Or, at least Anthropic wasn’t able to pull this off; it seems plausible that Anthropic is substantially worse at RL than OpenAI (at least at some aspects of RL like effectively scaling up RL on more narrow tasks). Interestingly, reasoning doesn’t seem to help Anthropic models on agentic software engineering tasks, but does help OpenAI models.
We haven’t yet seen much better models due to more (or algorithmically improved) pretraining AFAICT.
We haven’t seen OpenAI releases that perform substantially better than o3 at software engineering yet despite o3 being announced 7 months ago. (That said, o3 was actually released only 3 months ago.)
I updated towards thinking that the training of o3 was more focused on software engineering than I previously thought (at least the final release version of o3) and the returns weren’t that big. (This is due to rumors, seeing that OpenAI was training on software engineering tasks here, and based on OpenAI releases and communication like Codex.)
I updated a bit against this perspective due to xAI seemingly scaling things up a bunch, but I don’t put as much weight on this because it seems pretty plausible they just did a bad job scaling things up. (E.g., maybe they didn’t actually scale up RL to pretraining scale or if they did, maybe this RL was mostly compute inefficient RL on lower quality environments. xAI might also just generally be algorithmically behind.)
GPT-5 is expected to be released in 0.5-3 weeks and rumors indicate that it is substantially more focused on practical (agentic) software engineering. This is (arguably) the first major model release from OpenAI since o3, and it should resolve some of our uncertainties (particularly related to whether there was/is a bunch of low hanging fruit at OpenAI due to them not being very focused on software engineering).
My expectation is that GPT-5 will be a decent amount better than o3 on agentic software engineering (both in benchmarks and in practice), but won’t be substantially above trend. In particular, my median is that it will have a 2.75 hour time horizon[1] on METR’s evaluation suite[2]. This prediction was produced by extrapolating out the faster 2024-2025 agentic software engineering time horizon trend from o3 and expecting GPT-5 will be slightly below trend.[3]
If GPT-5 is actually a large (way above trend) jump in agentic software engineering with (e.g.) a >6 hour time horizon[4] (which seems plausible but unlikely to me), then we’ll have seen relatively fast (and possibly very fast) software progress in 2025 and we’d naively expect this to continue.[5] If GPT-5 is below trend[6], then it seems like the case against expecting relatively faster AI progress in 2025/2026 due to scaling up RL focused on agentic software engineering is pretty strong.
Overall, I wonder if I have (thus far) insufficiently updated my overall timelines picture based on the observations we’ve had so far in 2025. I’m a bit worried that I’m still operating on cached beliefs when these observations should have pushed away a bunch of the shorter timelines mass. Regardless, I think that the release of GPT-5 (or really, 2-6 weeks after the release of GPT-5 so that we have a better picture of GPT-5′s capabilities) will be a good point to (re)assess and consider stronger updates.
Edit: An earlier version of this post said “3.5 hours”, but this was actually a mistake because I thought o3 had a 2 hour time horizon when it actually has a 1.5 hour time horizon. I also edited from “>8″ to “>6” at a later point in this post as “>8 hours” was meant to refer to 2 doublings from o3 which is actually “>6 hours”.
I do worry that METR’s evaluation suite will start being less meaningful and noisier for longer time horizons as the evaluation suite was built a while ago. We could instead look at 80% reliability time horizons if we have concerns about the harder/longer tasks.
The faster 2024-2025 agentic software engineering time horizon (see figure 19 in METR’s paper) has a 4 month doubling time. o3 was released 4 months before GPT-5 is expected to be released and o3 has a 1.5 hour time horizon (edit: this used to say 2 hour which was a mistake), so this yields a 3 hour time horizon for GPT-5. I think that GPT-5 is more likely than not to be below trend (on at least METR’s specific evaluation) so I round this down a bit to 2.75 hours, though I have a pretty wide confidence interval. I expect below trend rather than above trend due to some early reports about GPT-5, the trend being pretty fast, Opus 4 having lower than expected results, and thinking that the METR evaluation suite might have issues with larger time horizons that result in misleadingly lower numbers.
Again, I’d want to look at multiple metrics. I’m referring to seeing agentic software engineering performance that looks analogous to a >6 hour time horizon on METR’s evaluation suite when aggregating over multiple relevant metrics.
It seems more likely to be a massive jump if OpenAI actually wasn’t yet very focused on agentic software engineering when training o3, but is more focused on this now. This article claims that something like this is the case.
It’s harder to confidently notice that GPT-5 is below trend relative to how hard it is to tell if GPT-5 is way above trend. We should expect it’s some amount better than o3 and the difference between a 2 and a 3 hour time horizon is legitimately hard to measure.
I basically agree with this whole post. I used to think there were double-digit % chances of AGI in each of 2024 and 2025 and 2026, but now I’m more optimistic, it seems like “Just redirect existing resources and effort to scale up RL on agentic SWE” is now unlikely to be sufficient (whereas in the past we didn’t have trends to extrapolate and we had some scary big jumps like o3 to digest)
I still think there’s some juice left in that hypothesis though. Consider how in 2020, one might have thought “Now they’ll just fine-tune these models to be chatbots and it’ll become a mass consumer product” and then in mid-2022 various smart people I know were like “huh, that hasn’t happened yet, maybe LLMs are hitting a wall after all” but it turns out it just took till late 2022/early 2023 for the kinks to be worked out enough.
Also, we should have some credence on new breakthroughs e.g. neuralese, online learning, whatever. Maybe like 8%/yr? Of a breakthrough that would lead to superhuman coders within a year or two, after being appropriately scaled up and tinkered with.
Re neuralese/online or continual learning or long-term memory that isn’t solely a context window breakthrough, I’m much more skeptical of it being very easy to integrate breakthroughs on short timelines, because it’s likely that changes will have to be made to the architecture that aren’t easy to do very quickly.
The potential for breakthroughs combined with the fact that Moore’s law will continue, making lots of compute cheap for researchers is a reason I think that my median timelines aren’t in the latter half of the century, but I think that it’s much more implausible to get it working very soon, so I’m much closer to 0.3% a year from 2025-2027.
@Mo Putera @the gears to ascension take the Moore’s law will continue point as a prediction that new paradigms like memristors will launch new S-curves of efficiency until we reach the Landauer Limit, which is 6.5 OOMs away, and that the current paradigm has 200x more efficiency savings to go:
https://www.forethought.org/research/how-far-can-ai-progress-before-hitting-effective-physical-limits#chip-technology-progress
I use ‘ultrathink’ in Claude Code all the time and find that it makes a difference.
I’m overall skeptical of overinterpreting/extrapolating the METR numbers. It is far too anchored on the capabilities of a single AI model, a lightweight scaffold, and a notion of ‘autonomous’ task completion of ‘human-hours’. I think this is a mental model for capabilities progress that will lead to erroneous predictions.
If you are trying to capture the absolute frontier of what is possible, you don’t only test a single-acting model in an empty codebase with limited internet access and scaffolding. I would personally be significantly less capable at agentic coding if I only used 1 model (like replicating subliminal learning in about 1 hour of work + 2 hours of waiting for fine-tunes on the day of the release) with limited access to resources. You are instead using a variety of AI models based on their pros and cons[1], with well-crafted codebases for agentic coding and giving them access to whatever they want on the internet as a reference (+ much more)[2]. METR does note this limitation, but I want to emphasize its importance and potential for misleading extrapolations if people only consider the headline charts without considering the nuance.
Anthropic suggests multi-agent scaffolds are much better for research.
We note some of what that might look like here.
I think non-formal IMO gold was unexpected and we heard explicitly that it won’t be in GPT-5. So I would wait to see how it would pan out. It may not matter in 2025 but I think it can in 2026.
Why should we think that the relevant progress driving non-formal IMO is very important for plausibly important capabilities like agentic software engineering? I’d guess the transfer is relatively weak unless the IMO results were driven by general purpose advances. This seems somewhat unlikely: if the main breakthrough was in better performance on non-trivial-to-verify tasks (as various posts from OpenAI people claim), then even if this generalizes well beyond proofs this wouldn’t obviously particularly help with agentic software engineering (where the core blocker doesn’t appear to be verification difficulty).
Edit: I think I mostly retract this comment, see below.
I’m surprised by this. To me it seems hugely important how fast AIs are improving on tasks with poor feedback loops, because obviously they’re in a much better position to improve on easy-to-verify tasks, so “tasks with poor feedback loops” seem pretty likely to be the bottleneck to an intelligence explosion.
So I definitely do think that “better performance on non-trivial-to-verify tasks” are very important for some “plausibly important capabilities”. Including agentic software engineering. (Like: This also seems related to why the AIs are much better at benchmarks than at helping people out with their day-to-day work.)
Hmm, yeah I think you’re right, though I also don’t think I articulated what I was trying to say very well.
Like I think my view is:
There was some story where we would see very fast progress in relatively easy to verify (or trivial to verify) tasks and I’m talking about that. It seems like agentic software engineering could reach very high levels without necessarily needing serious improvements in harder to verify tasks.
Faster progress in non-trivial-to-verify tasks might not be the limiting factor if progress in easy to verify tasks isn’t that fast.
I still think that there won’t be a noticable jump as the IMO methods make it into production models but this is due to more general heuristics (and the methods maybe still matter, it just won’t be something to wait for I think).
I think IMO results were driven by general purpose advances, but I agree I can’t conclusively prove it because we don’t know details. Hopefully we will learn more as time goes by.
An informal argument: I think currently agentic software engineering is blocked on context rot, among other things. I expect IMO systems to have improved on this, since IMO time control is 1.5 hours per problem.
(I’m skeptical that much of the IMO improvement was due to improving how well AIs can use their context in general. This isn’t a crux for my view, but it also seems pretty likely that the AIs didn’t do more than ~100k serial tokens of reasoning for the IMO while still aggregating over many such reasoning traces.)
I wrote an update here.
Now that GPT-5 is released and we have details about Grok’s failure, we can start the re-assessment.
GPT-5 reached 2h17m, which seems like excellent news. However, excluding spurious failures would bring GPT-5′s performance to 2h41m, which aligns with Greenblatt’s prediction. Moreover, METR evaluators themselves think that “GPT-5 could have benefitted from a larger token budget”, implying that the benchmark began to corrupt. What other relevant metrics there exist?
The AI-2027 forecast has mid-2025 agents reach 85% on SWE-bench verified and 65% on the OSWorld benchmark.
OSWorld reached 60% on August 4 if we use no filters. SWE-bench with a minimal agent has Claude Opus 4 (20250514) reach 67.6% when evaluated in August. Moreover, on August 7 the only models that SWE-bench evaluated after 1st July were Claude 4 Opus and two Chinese models. In June SWE-bench verified reached 75% with TRAE. And now TRAE claims to use Grok 4 and Kimi K2.
Grok 4 managed to fail on tasks worthy of 2-4 seconds(!!), 2-4 minutes and to experience a fiasco on 2-4 hours long tasks. Page 22 of the METR paper could imply that the dataset contains few tasks that are 2-4 hrs long. If tasks worthy of 2-4 seconds, minutes or hours “sandbagged” Grok’s 80% time horizon to 15 minutes, then the metric underestimates Grok’s true capabilities.
While there are no estimates of Gemini 2.5-Deep Think, which was released on August 1, IIRC a LessWronger claimed that the public version received a bronze medal on IMO 2025. Another LessWronger claimed that “Gemini was ahead of openai on the IMO gold. The output was more polished so presumably they achieved a gold worthy model earlier. I expect gemini’s swe bench to thus at least be ahead of OpenAI’s 75%. ”
To conclude, I doubt that we still have benchmarks that can be relied upon to quickly estimate the models’ capabilities: SWE-bench and OSWorld are likely too slow, METR began to fill with noise. While we do have ARC-AGI yet, Grok’s success could have demonstrated the ability to gamble it. And that’s ignoring Claude’s potential improvements after Opus 4.1...
EDIT: TRAE uses an unknown scaffolding. However, applying mini-SWE-agent to Claude 4 Opus (20250514) yields better results than GPT-5, implying that other benchmarks might also increase after the Claude Opus 4 update to 4.1 and future updates.
If the correlations continue to hold, this would map to something like a 78% to 80% range on swe-bench pass @ 1 (which is likely to be announced at release). I’m personally not this bearish (I’d guess low 80s given that benchmark has reliably jumped ~3.5% monthly), but we shall see.
Needless to say if it scores 80%, we are well below AI 2027 timeline predictions with high confidence.
Isn’t the SWE-Bench figure and doubling time estimate from the blogpost even more relevant here than fig. 19 from the METR paper?
The data is pretty low-quality for that graph because the agents we used were inconsistent and Claude 3-level models could barely solve any tasks. Epoch has better data for SWE-bench Verified, which I converted to time horizon here and found to also be doubling every 4 months ish. Their elicitation is probably not as good for OpenAI as Anthropic models, but both are increasing at similar rates.
no. gpt5 is the cheap, extremely good writing model, imo. much better writer out there rn than any other model
eval to pay attention to:
I think that even before the release of GPT-5 and setting aside Grok 4′s problems I have a weak case against non-neuralese AI progress being likely to be fast. Recall the METR measurements.
The time horizon of base LLMs experienced a slowdown or plateau[1] between GPT-4 (5 minutes, Mar′23) and GPT-4o (9 min, May ’24).
Evaluation of Chinese models has DeepSeek’s time horizons[2] change only from 18 to 31 minutes between[3] V3 (Dec ’24) and R1-0528 (May ’25).
While Grok 4 was likely trained incompetently[4] and/or for the benchmarks, its 50% time horizon is 1.83 hrs (vs. o3′s 1.54 hrs) and 80% time horizon is 15 min (vs. o3′s 20 min) In other words, Grok 4′s performance is comparable with that of o3.
Taken together, two plateaus and Grok 4′s failure suggest a troubling pattern: creation of an AGI is likely to require[5] neuralese, which will likely prevent the humans from noticing misalignment.
While GPT-4.5 has a time horizon between 30 and 40 mins, it, unlike GPT-4o, was a MoE and was trained on CoTs.
Alas, METR’s evaluation of DeepSeek’s capabilities might have missed “agent scaffolds which could elicit the capabilities of the evaluated models much more effectively”. If there exists an alternate scaffold where R1-0528 becomes a capable agent and V3 doesn’t, then DeepSeek’s models are not on a plateau.
In addition, DeepSeek V3 released in December didn’t use a CoT. If the main ingredient necessary for capabilities increase is a MoE, not the CoT, then what can be said about Kimi K2?
Grok 4 could have also been deliberately trained on complex tasks, which might have made the success rate less time-dependent. After all, it did reach 16% on the ARC-AGI-2 benchmark.
There is, however, Knight Lee’s proposal or the creation of many agents having access to each other’s CoTs and working in parallel. While Grok 4 Heavy could be a step in this direction, the agents receive access to each other’s CoTs after they finish the work.
Which reports, specifically?