Thomas Kwa comments on Cole Wyeth’s Shortform

Thomas Kwa 24 May 2025 16:12 UTC
17 points
0
I don’t run the evaluations but probably we will; no timeframe yet though as we would need to do elicitation first. Claude’s SWE-bench Verified scores suggest that it will be above 2 hours on the METR task set; the benchmarks are pretty similar apart from their different time annotations.
- Aaron Staley 24 May 2025 19:04 UTC
  3 points
  0
  Parent
  That’s a bit higher than I would have guessed. I compared the known data points that have SWE-bench and METR medians (sonnet 3.5,3.6,3.7, o1, o3, o4-mini) and got an r^2 = 0.96 model assuming linearity between log(METR_median) and log(swe-bench-error).
  
  That gives an estimate more like 110 minutes for an Swe-bench score of 72.7%. Which works out to a sonnet doubling time of ~3.3 months. (If I throw out o4-mini, estimator is ~117 minutes.. still below 120)
  
  Also would imply an 85% swe-bench score is something like a 6-6.5 hour METR median.