Wouldn’t you expect this if we’re close to saturating SWE bench (and some of the tasks are impossible)? Like, you eventually cap out at the max performance for swe bench and this doesn’t correspond to an infinite time horizon on literally swe bench (you need to include more longer tasks).
SWE bench verified shouldn’t have that many impossible tasks if any, right? And the highest scores for the rankings I used are still significantly below 80%. But it’s possible. Maybe a good motivation to look at SWE bench pro.
I’d guess swe bench verified has an error rate around 5% or 10%. They didn’t have humans baseline the tasks, just look at them and see if they seem possible.
Wouldn’t you expect thing to look logistic substantially before full saturation?
It depends how the work times of these unsolvable tasks are distributed, you could in principle get any outcome. But there are a few ways to check for the existence of unsolvable tasks, maybe I’ll find the time today.
Hmm, actually all these checks can’t distinguish between actually unsolvable tasks and tasks that are unsolvable for further scaled up models of the current kind (with the framework and compute used in the evaluations).
Wouldn’t you expect this if we’re close to saturating SWE bench (and some of the tasks are impossible)? Like, you eventually cap out at the max performance for swe bench and this doesn’t correspond to an infinite time horizon on literally swe bench (you need to include more longer tasks).
SWE bench verified shouldn’t have that many impossible tasks if any, right? And the highest scores for the rankings I used are still significantly below 80%. But it’s possible. Maybe a good motivation to look at SWE bench pro.
I’d guess swe bench verified has an error rate around 5% or 10%. They didn’t have humans baseline the tasks, just look at them and see if they seem possible.
Wouldn’t you expect thing to look logistic substantially before full saturation?
It depends how the work times of these unsolvable tasks are distributed, you could in principle get any outcome. But there are a few ways to check for the existence of unsolvable tasks, maybe I’ll find the time today.
Hmm, actually all these checks can’t distinguish between actually unsolvable tasks and tasks that are unsolvable for further scaled up models of the current kind (with the framework and compute used in the evaluations).