SWE bench verified shouldn’t have that many impossible tasks if any, right? And the highest scores for the rankings I used are still significantly below 80%. But it’s possible. Maybe a good motivation to look at SWE bench pro.
I’d guess swe bench verified has an error rate around 5% or 10%. They didn’t have humans baseline the tasks, just look at them and see if they seem possible.
Wouldn’t you expect thing to look logistic substantially before full saturation?
I fitted logistic functions and gaussian cdfs with a factor to the trend of the percentage scores for the four rankings I analysed and they all asymptote below 80%. The idea was to find some evidence for an “irreducible error”.
But given that 20+% error rate is clearly way too high, it still makes more sense to me to argue that the improvement is slowing and therefore these fits asymptote too low, than to argue that the time horizons and percentages are asymptoting because of a high percentage of unsolvable tasks.
But this gave me a more general idea of assessing changes in improvement speed: The default assumption right now should be that model improvement moves linearly through the log of the time horizon space. Additionally, I found that at least SWE-bench verified seems to have task lengths that are lognormally distributed and I suspect that holds for many benchmarks.
This means that the way to saturation should follow a gaussian cdf. Now the idea would be that one can use the movement through the first x percent of the benchmark to fit the gaussian cdf (or at least sanity check that assumption) and then see whether the model slows down for the rest of the benchmark. To put it differently: Constant improvement speed → Symmetric underlying gaussian of the cdf. Slowdown → Right tail gets fatter.
Of course the signal would be pretty weak, but if one would aggregate this over several benchmarks, it might make a good speedometer.
Conditional on a slowdown in AI progress, my primary hypothesis is that the problem is that recent AI models haven’t scaled much in compute compared to past models and have relied on RL progress, and current RL is becoming less and less of a free lunch than before and is actually less efficient than pre-training.
Which is a slight update against software-only singularity stories occurring.
It depends how the work times of these unsolvable tasks are distributed, you could in principle get any outcome. But there are a few ways to check for the existence of unsolvable tasks, maybe I’ll find the time today.
Hmm, actually all these checks can’t distinguish between actually unsolvable tasks and tasks that are unsolvable for further scaled up models of the current kind (with the framework and compute used in the evaluations).
SWE bench verified shouldn’t have that many impossible tasks if any, right? And the highest scores for the rankings I used are still significantly below 80%. But it’s possible. Maybe a good motivation to look at SWE bench pro.
I’d guess swe bench verified has an error rate around 5% or 10%. They didn’t have humans baseline the tasks, just look at them and see if they seem possible.
Wouldn’t you expect thing to look logistic substantially before full saturation?
Side note: we find evidence of an error rate for SWE bench verified between 5 and 10% in our benchmark review.
https://epoch.ai/blog/what-skills-does-swe-bench-verified-evaluate
I fitted logistic functions and gaussian cdfs with a factor to the trend of the percentage scores for the four rankings I analysed and they all asymptote below 80%. The idea was to find some evidence for an “irreducible error”.
But given that 20+% error rate is clearly way too high, it still makes more sense to me to argue that the improvement is slowing and therefore these fits asymptote too low, than to argue that the time horizons and percentages are asymptoting because of a high percentage of unsolvable tasks.
But this gave me a more general idea of assessing changes in improvement speed: The default assumption right now should be that model improvement moves linearly through the log of the time horizon space. Additionally, I found that at least SWE-bench verified seems to have task lengths that are lognormally distributed and I suspect that holds for many benchmarks.
This means that the way to saturation should follow a gaussian cdf. Now the idea would be that one can use the movement through the first x percent of the benchmark to fit the gaussian cdf (or at least sanity check that assumption) and then see whether the model slows down for the rest of the benchmark. To put it differently: Constant improvement speed → Symmetric underlying gaussian of the cdf. Slowdown → Right tail gets fatter.
Of course the signal would be pretty weak, but if one would aggregate this over several benchmarks, it might make a good speedometer.
Conditional on a slowdown in AI progress, my primary hypothesis is that the problem is that recent AI models haven’t scaled much in compute compared to past models and have relied on RL progress, and current RL is becoming less and less of a free lunch than before and is actually less efficient than pre-training.
Which is a slight update against software-only singularity stories occurring.
It depends how the work times of these unsolvable tasks are distributed, you could in principle get any outcome. But there are a few ways to check for the existence of unsolvable tasks, maybe I’ll find the time today.
Hmm, actually all these checks can’t distinguish between actually unsolvable tasks and tasks that are unsolvable for further scaled up models of the current kind (with the framework and compute used in the evaluations).