I very roughly polled METR staff (using Fatebook) what the 50% time horizon will be by EOY 2026, conditional on METR reporting something analogous to today’s time horizon metric.
I got the following results: 29% average probability that it will surpass 32 hours. 68% average probability that it will surpass 16 hours.
The first question got 10 respondents and the second question got 12. Around half of the respondents were technical researchers. I expect the sample to be close to representative, but maybe a bit more short-timelines than the rest of METR staff.
The average probability that the question doesn’t resolve AMBIGUOUS is somewhere around 60%.
Just for context, the reason we might not report something like today’s time horizon metric is we don’t have enough tasks beyond 8 hours. We’re actively working on several ways to extend this, but there’s always a chance none of them will work out and we won’t have enough confidence to report a number by the end of 2026.
Nikola’s comment about the 20hr median, let alone the 29% probability of a 32hr horizon or higher, does require more than two doublings (and, in the case of 20hr, far closer to three doublings) of GPT-5.1-Codex-Max’s result of 2h42m. The most recent trend of a doubling per 7 months is the trend observed between o3 and GPT-5.1-Codex-Max. But there was the less recent trend of Claude 3.5 Sonnet-o3 where a doubling would happen in 4 months.
I suspect that METR will soon publish information about Gemini 3 Pro, Claude Opus 4.5 and GPT-5.2, and it will let us learn METR’s rationale behind the return of the fast doubling trend. Or, if METR understands the threat of 20hr+ time horizons, then METR could be trying to add THAT long tasks to their suite (optimizing a BIG library of bad code? Developing complex apps?)
I very roughly polled METR staff (using Fatebook) what the 50% time horizon will be by EOY 2026, conditional on METR reporting something analogous to today’s time horizon metric.
I got the following results: 29% average probability that it will surpass 32 hours. 68% average probability that it will surpass 16 hours.
The first question got 10 respondents and the second question got 12. Around half of the respondents were technical researchers. I expect the sample to be close to representative, but maybe a bit more short-timelines than the rest of METR staff.
The average probability that the question doesn’t resolve AMBIGUOUS is somewhere around 60%.
Just for context, the reason we might not report something like today’s time horizon metric is we don’t have enough tasks beyond 8 hours. We’re actively working on several ways to extend this, but there’s always a chance none of them will work out and we won’t have enough confidence to report a number by the end of 2026.
This very roughly implies that the median of “50% time horizon as predicted by METR staff” by EOY 2026 is a bit higher than 20 hours.
This would be way above-trend??
Nikola’s comment about the 20hr median, let alone the 29% probability of a 32hr horizon or higher, does require more than two doublings (and, in the case of 20hr, far closer to three doublings) of GPT-5.1-Codex-Max’s result of 2h42m. The most recent trend of a doubling per 7 months is the trend observed between o3 and GPT-5.1-Codex-Max. But there was the less recent trend of Claude 3.5 Sonnet-o3 where a doubling would happen in 4 months.
I suspect that METR will soon publish information about Gemini 3 Pro, Claude Opus 4.5 and GPT-5.2, and it will let us learn METR’s rationale behind the return of the fast doubling trend. Or, if METR understands the threat of 20hr+ time horizons, then METR could be trying to add THAT long tasks to their suite (optimizing a BIG library of bad code? Developing complex apps?)