plex comments on plex’s Shortform

plex 22 Oct 2025 14:57 UTC
36 points
18
2021 forecasters vs 2025 reality:
“Current performance on [the MATH] dataset is quite low--6.9%--and I expected this task to be quite hard for ML models in the near future. However, forecasters predict more than 50% accuracy* by 2025! This was a big update for me. (*More specifically, their median estimate is 52%; the confidence range is ~40% to 60%, but this is potentially artifically narrow due to some restrictions on how forecasts could be input into the platform.)”(source)
Reality: 97.3%+ ^[1] (on a narrowed subset of only the hardest questions, including just difficulty 5 ones, called MATH-500 which was made because the original benchmark got saturated)
Reliable forecasting requires either a generative model of the underlying dynamics, or a representative reference class. The singularity has no good reference class, so people trying to use reference classes rather than gears modelling will predictably be spectacularly wrong.
1. ^
- NunoSempere 23 Oct 2025 15:48 UTC
  24 points
  9
  Parent
  I mentioned this before, but that interface didn’t allow for wide spreads, so the thing that you might be looking at is a plot of people’s medians, not the whole distribution. In general Hypermind’s user interface was so so shitty, and they paid so poorly, if at all, that I don’t think it’s fair to describe that screenshot as “superforecasters”.
  - plex 23 Oct 2025 17:32 UTC
    2 points
    0
    Parent
    Thanks, updated to forecasters, does that seem fair?
    Also, I know this is super hard, but do you have a sense of what superforcasters might have guessed back then?
- Thomas Kwa 23 Oct 2025 0:14 UTC
  9 points
  3
  Parent
  While the singularity doesn’t have a reference class, benchmarks do have a reference class—we have enough of them that we can fit reasonable distributions on when benchmarks will reach 50%, be saturated, etc., especially if we know the domain. The harder part is measuring superintelligence with benchmarks.
  - plex 23 Oct 2025 9:07 UTC
    2 points
    0
    Parent
    Yup, agree this holds for a while, and agree that benchmarking superintelligence is tricky. However, I think there’s reason to expect the dynamics driving the benchmarks to change in ways which don’t have clear precedent at some point, once new parts of the feedback loops seriously take off.
    I operationalize as: I expect >2 and at least twice as many trend-breaking events in the reference of the move from 7 to 4 month doubling times as downwards trend breaking ones.
    I expect that at some point, possibly after loss of control, a large number of things unblock in quick succession and you get a supercritical chain reaction of improvements to capabilities which bows naive extrapolation out of the water.
- elifland 28 Oct 2025 4:19 UTC
  4 points
  0
  Parent
  You might be interested in my retrospective here / here (direct link to raw predictions/rationales here)