Nikola Jurkovic comments on nikola’s Shortform

Nikola Jurkovic 14 Dec 2025 18:56 UTC
27 points
2
I very roughly polled METR staff (using Fatebook) what the 50% time horizon will be by EOY 2026, conditional on METR reporting something analogous to today’s time horizon metric.
I got the following results: 29% average probability that it will surpass 32 hours. 68% average probability that it will surpass 16 hours.
The first question got 10 respondents and the second question got 12. Around half of the respondents were technical researchers. I expect the sample to be close to representative, but maybe a bit more short-timelines than the rest of METR staff.
The average probability that the question doesn’t resolve AMBIGUOUS is somewhere around 60%.
What links here?
- StanislavKrym's comment on Response to titotal’s critique of our AI 2027 timelines model by elifland (16 Dec 2025 8:55 UTC; 1 point)
- StanislavKrym's comment on Response to titotal’s critique of our AI 2027 timelines model by elifland (17 Dec 2025 23:14 UTC; 1 point)
- Thomas Kwa 15 Dec 2025 19:24 UTC
  16 points
  0
  Parent
  Just for context, the reason we might not report something like today’s time horizon metric is we don’t have enough tasks beyond 8 hours. We’re actively working on several ways to extend this, but there’s always a chance none of them will work out and we won’t have enough confidence to report a number by the end of 2026.
- Nikola Jurkovic 14 Dec 2025 19:00 UTC
  9 points
  0
  Parent
  This very roughly implies that the median of “50% time horizon as predicted by METR staff” by EOY 2026 is a bit higher than 20 hours.
- Cole Wyeth 15 Dec 2025 5:07 UTC
  2 points
  1
  Parent
  This would be way above-trend??
  - StanislavKrym 15 Dec 2025 10:15 UTC
    2 points
    1
    Parent
    Nikola’s comment about the 20hr median, let alone the 29% probability of a 32hr horizon or higher, does require more than two doublings (and, in the case of 20hr, far closer to three doublings) of GPT-5.1-Codex-Max’s result of 2h42m. The most recent trend of a doubling per 7 months is the trend observed between o3 and GPT-5.1-Codex-Max. But there was the less recent trend of Claude 3.5 Sonnet-o3 where a doubling would happen in 4 months.
    I suspect that METR will soon publish information about Gemini 3 Pro, Claude Opus 4.5 and GPT-5.2, and it will let us learn METR’s rationale behind the return of the fast doubling trend. Or, if METR understands the threat of 20hr+ time horizons, then METR could be trying to add THAT long tasks to their suite (optimizing a BIG library of bad code? Developing complex apps?)