That’s a bit higher than I would have guessed. I compared the known data points that have SWE-bench and METR medians (sonnet 3.5,3.6,3.7, o1, o3, o4-mini) and got an r^2 = 0.96 model assuming linearity between log(METR_median) and log(swe-bench-error).
That gives an estimate more like 110 minutes for an Swe-bench score of 72.7%. Which works out to a sonnet doubling time of ~3.3 months. (If I throw out o4-mini, estimator is ~117 minutes.. still below 120)
Also would imply an 85% swe-bench score is something like a 6-6.5 hour METR median.
That’s a bit higher than I would have guessed. I compared the known data points that have SWE-bench and METR medians (sonnet 3.5,3.6,3.7, o1, o3, o4-mini) and got an r^2 = 0.96 model assuming linearity between log(METR_median) and log(swe-bench-error).
That gives an estimate more like 110 minutes for an Swe-bench score of 72.7%. Which works out to a sonnet doubling time of ~3.3 months. (If I throw out o4-mini, estimator is ~117 minutes.. still below 120)
Also would imply an 85% swe-bench score is something like a 6-6.5 hour METR median.