I don’t run the evaluations but probably we will; no timeframe yet though as we would need to do elicitation first. Claude’s SWE-bench Verified scores suggest that it will be above 2 hours on the METR task set; the benchmarks are pretty similar apart from their different time annotations.
That’s a bit higher than I would have guessed. I compared the known data points that have SWE-bench and METR medians (sonnet 3.5,3.6,3.7, o1, o3, o4-mini) and got an r^2 = 0.96 model assuming linearity between log(METR_median) and log(swe-bench-error).
That gives an estimate more like 110 minutes for an Swe-bench score of 72.7%. Which works out to a sonnet doubling time of ~3.3 months. (If I throw out o4-mini, estimator is ~117 minutes.. still below 120)
Also would imply an 85% swe-bench score is something like a 6-6.5 hour METR median.
I don’t run the evaluations but probably we will; no timeframe yet though as we would need to do elicitation first. Claude’s SWE-bench Verified scores suggest that it will be above 2 hours on the METR task set; the benchmarks are pretty similar apart from their different time annotations.
That’s a bit higher than I would have guessed. I compared the known data points that have SWE-bench and METR medians (sonnet 3.5,3.6,3.7, o1, o3, o4-mini) and got an r^2 = 0.96 model assuming linearity between log(METR_median) and log(swe-bench-error).
That gives an estimate more like 110 minutes for an Swe-bench score of 72.7%. Which works out to a sonnet doubling time of ~3.3 months. (If I throw out o4-mini, estimator is ~117 minutes.. still below 120)
Also would imply an 85% swe-bench score is something like a 6-6.5 hour METR median.