Nikola’s comment about the 20hr median, let alone the 29% probability of a 32hr horizon or higher, does require more than two doublings (and, in the case of 20hr, far closer to three doublings) of GPT-5.1-Codex-Max’s result of 2h42m. The most recent trend of a doubling per 7 months is the trend observed between o3 and GPT-5.1-Codex-Max. But there was the less recent trend of Claude 3.5 Sonnet-o3 where a doubling would happen in 4 months.
I suspect that METR will soon publish information about Gemini 3 Pro, Claude Opus 4.5 and GPT-5.2, and it will let us learn METR’s rationale behind the return of the fast doubling trend. Or, if METR understands the threat of 20hr+ time horizons, then METR could be trying to add THAT long tasks to their suite (optimizing a BIG library of bad code? Developing complex apps?)
Nikola’s comment about the 20hr median, let alone the 29% probability of a 32hr horizon or higher, does require more than two doublings (and, in the case of 20hr, far closer to three doublings) of GPT-5.1-Codex-Max’s result of 2h42m. The most recent trend of a doubling per 7 months is the trend observed between o3 and GPT-5.1-Codex-Max. But there was the less recent trend of Claude 3.5 Sonnet-o3 where a doubling would happen in 4 months.
I suspect that METR will soon publish information about Gemini 3 Pro, Claude Opus 4.5 and GPT-5.2, and it will let us learn METR’s rationale behind the return of the fast doubling trend. Or, if METR understands the threat of 20hr+ time horizons, then METR could be trying to add THAT long tasks to their suite (optimizing a BIG library of bad code? Developing complex apps?)