ryan_greenblatt comments on p.b.’s Shortform

ryan_greenblatt 5 Nov 2025 1:01 UTC
2 points
0
I’d guess swe bench verified has an error rate around 5% or 10%. They didn’t have humans baseline the tasks, just look at them and see if they seem possible.

Wouldn’t you expect thing to look logistic substantially before full saturation?
- p.b. 5 Nov 2025 7:20 UTC
  2 points
  0
  Parent
  It depends how the work times of these unsolvable tasks are distributed, you could in principle get any outcome. But there are a few ways to check for the existence of unsolvable tasks, maybe I’ll find the time today.
  - p.b. 5 Nov 2025 9:12 UTC
    2 points
    0
    Parent
    Hmm, actually all these checks can’t distinguish between actually unsolvable tasks and tasks that are unsolvable for further scaled up models of the current kind (with the framework and compute used in the evaluations).