I computed METR time horizons for SWE bench verified sota models using both the existing difficulty estimates and work time estimates derived from commit data.
I used a range of different methods including the original METR methodology where task level success info was available.
I did this for 4 different rankings, EpochAI’s, LLMStats’s and the “verified” and “bash only” rankings of the SWE benchmark website.
In every single case the trend fits a logistic function with an asymptote of a couple of hours better than an exponential. In some cases the trend only becomes logistic with the last one or two datapoints, so it’s not surprising that the METR report has an exponential fit for SWE bench.
I am not sure when I get around to publishing this analysis, because it’s a giant mess of different datasets and methods. But I thought I at least state the result before it becomes irrelevant, falsified or obvious.
Wouldn’t you expect this if we’re close to saturating SWE bench (and some of the tasks are impossible)? Like, you eventually cap out at the max performance for swe bench and this doesn’t correspond to an infinite time horizon on literally swe bench (you need to include more longer tasks).
SWE bench verified shouldn’t have that many impossible tasks if any, right? And the highest scores for the rankings I used are still significantly below 80%. But it’s possible. Maybe a good motivation to look at SWE bench pro.
I’d guess swe bench verified has an error rate around 5% or 10%. They didn’t have humans baseline the tasks, just look at them and see if they seem possible.
Wouldn’t you expect thing to look logistic substantially before full saturation?
It depends how the work times of these unsolvable tasks are distributed, you could in principle get any outcome. But there are a few ways to check for the existence of unsolvable tasks, maybe I’ll find the time today.
Hmm, actually all these checks can’t distinguish between actually unsolvable tasks and tasks that are unsolvable for further scaled up models of the current kind (with the framework and compute used in the evaluations).
I wouldn’t take one or two datapoints on a single benchmark too seriously, especially with a methodology as fiddly as time horizon and concerns like Ryan’s. Nevertheless seems like a good thought that you replicated using time estimates from commit data, as the original difficulty estimates seemed likely to be noisy. I’ll be interested to see if the trend continues and if the same is currently true with OSWorld (Looks like they had a big update so maybe it’s possible to get individual task data now.)
I computed METR time horizons for SWE bench verified sota models using both the existing difficulty estimates and work time estimates derived from commit data.
I used a range of different methods including the original METR methodology where task level success info was available.
I did this for 4 different rankings, EpochAI’s, LLMStats’s and the “verified” and “bash only” rankings of the SWE benchmark website.
In every single case the trend fits a logistic function with an asymptote of a couple of hours better than an exponential. In some cases the trend only becomes logistic with the last one or two datapoints, so it’s not surprising that the METR report has an exponential fit for SWE bench.
I am not sure when I get around to publishing this analysis, because it’s a giant mess of different datasets and methods. But I thought I at least state the result before it becomes irrelevant, falsified or obvious.
Wouldn’t you expect this if we’re close to saturating SWE bench (and some of the tasks are impossible)? Like, you eventually cap out at the max performance for swe bench and this doesn’t correspond to an infinite time horizon on literally swe bench (you need to include more longer tasks).
SWE bench verified shouldn’t have that many impossible tasks if any, right? And the highest scores for the rankings I used are still significantly below 80%. But it’s possible. Maybe a good motivation to look at SWE bench pro.
I’d guess swe bench verified has an error rate around 5% or 10%. They didn’t have humans baseline the tasks, just look at them and see if they seem possible.
Wouldn’t you expect thing to look logistic substantially before full saturation?
It depends how the work times of these unsolvable tasks are distributed, you could in principle get any outcome. But there are a few ways to check for the existence of unsolvable tasks, maybe I’ll find the time today.
Hmm, actually all these checks can’t distinguish between actually unsolvable tasks and tasks that are unsolvable for further scaled up models of the current kind (with the framework and compute used in the evaluations).
I wouldn’t take one or two datapoints on a single benchmark too seriously, especially with a methodology as fiddly as time horizon and concerns like Ryan’s. Nevertheless seems like a good thought that you replicated using time estimates from commit data, as the original difficulty estimates seemed likely to be noisy. I’ll be interested to see if the trend continues and if the same is currently true with OSWorld (Looks like they had a big update so maybe it’s possible to get individual task data now.)
Yeah, I am also pretty much on the fence right now. But time will tell.