Aaron Staley comments on Cole Wyeth’s Shortform

Aaron Staley 24 May 2025 19:04 UTC
3 points
0
That’s a bit higher than I would have guessed. I compared the known data points that have SWE-bench and METR medians (sonnet 3.5,3.6,3.7, o1, o3, o4-mini) and got an r^2 = 0.96 model assuming linearity between log(METR_median) and log(swe-bench-error).

That gives an estimate more like 110 minutes for an Swe-bench score of 72.7%. Which works out to a sonnet doubling time of ~3.3 months. (If I throw out o4-mini, estimator is ~117 minutes.. still below 120)

Also would imply an 85% swe-bench score is something like a 6-6.5 hour METR median.