Aaron Staley comments on ryan_greenblatt’s Shortform

Aaron Staley 29 Jul 2025 16:25 UTC
3 points
0
My expectation is that GPT-5 will be a decent amount better than o3 on agentic software engineering (both in benchmarks and in practice), but won’t be substantially above trend. In particular, my median is that it will have a 2.75 hour time horizon^[1] on METR’s evaluation suite ^[2]. This prediction was produced by extrapolating out the faster 2024-2025 agentic software engineering time horizon trend from o3 and expecting GPT-5 will be slightly below trend.^[3]

If the correlations continue to hold, this would map to something like a 78% to 80% range on swe-bench pass @ 1 (which is likely to be announced at release). I’m personally not this bearish (I’d guess low 80s given that benchmark has reliably jumped ~3.5% monthly), but we shall see.
Needless to say if it scores 80%, we are well below AI 2027 timeline predictions with high confidence.