Jsevillamol comments on p.b.’s Shortform

Jsevillamol 10 Nov 2025 1:07 UTC
6 points
0
Side note: we find evidence of an error rate for SWE bench verified between 5 and 10% in our benchmark review.
https://epoch.ai/blog/what-skills-does-swe-bench-verified-evaluate
- p.b. 11 Nov 2025 12:51 UTC
  4 points
  0
  Parent
  I fitted logistic functions and gaussian cdfs with a factor to the trend of the percentage scores for the four rankings I analysed and they all asymptote below 80%. The idea was to find some evidence for an “irreducible error”.
  But given that 20+% error rate is clearly way too high, it still makes more sense to me to argue that the improvement is slowing and therefore these fits asymptote too low, than to argue that the time horizons and percentages are asymptoting because of a high percentage of unsolvable tasks.
  But this gave me a more general idea of assessing changes in improvement speed: The default assumption right now should be that model improvement moves linearly through the log of the time horizon space. Additionally, I found that at least SWE-bench verified seems to have task lengths that are lognormally distributed and I suspect that holds for many benchmarks.
  This means that the way to saturation should follow a gaussian cdf. Now the idea would be that one can use the movement through the first x percent of the benchmark to fit the gaussian cdf (or at least sanity check that assumption) and then see whether the model slows down for the rest of the benchmark. To put it differently: Constant improvement speed → Symmetric underlying gaussian of the cdf. Slowdown → Right tail gets fatter.
  Of course the signal would be pretty weak, but if one would aggregate this over several benchmarks, it might make a good speedometer.
  - Noosphere89 11 Nov 2025 18:16 UTC
    2 points
    0
    Parent
    Conditional on a slowdown in AI progress, my primary hypothesis is that the problem is that recent AI models haven’t scaled much in compute compared to past models and have relied on RL progress, and current RL is becoming less and less of a free lunch than before and is actually less efficient than pre-training.
    
    Which is a slight update against software-only singularity stories occurring.