Thomas Kwa comments on george_adams’s Shortform

Thomas Kwa 22 Dec 2025 5:11 UTC
10 points
0
Unfortunately the available benchmark tasks do not allow for 99%+ reliability measurements. Because we don’t have 1,000 different one-minute tasks the best we could do would be something like whether GPT5.1 can do all 40 tasks 25 times each with perfect reliability. Most likely it will succeed at all of them because we just don’t have a task that happens to trip it up.
As for humans’ 99.9%, at a granular enough level it would be 0.2 seconds (typing one keystroke) because few people have higher than 99.9% accuracy. But in the context of a larger task, we can correct our typos, so it isn’t super relevant.
What links here?
- Thomas Kwa's comment on Thomas Kwa’s Shortform by Thomas Kwa (15 Jan 2026 4:39 UTC; 133 points)
- Noosphere89's comment on faul_sname’s Shortform by faul_sname (24 Mar 2026 15:14 UTC; 5 points)
- Petropolitan 22 Dec 2025 15:17 UTC
  1 point
  0
  Parent
  Is 80% the highest success rate you can practically test?
  UPD Thomas essentially answered elsewhere:
  20% and 80% time horizons are kind of fake because there aren’t enough parameters to fit them separately.
  We fit a two-parameter logistic model which doesn’t fit the top and bottom of the success curve separately, so improving performance on 20% horizon tasks can lower 80% horizon.