Except that Grok 4 and GPT-5 arguably already didn’t adhere to the faster doubling time. And I say “arguably” because of Grok failing some primitive tasks and Greenblatt’s pre-release prediction of GPT-5′s time horizon. While METR technically didn’t confirm the prediction, METR itself acknowledged that it ran into problems when trying to calculate GPT-5′s time horizon.
EDIT: you mention a 5.7-month doubling time post-GPT-3.5. But there actually was a plateau or slowdown between GPT-4 and GPT-4o which was followed by the GPT4o-o3 accelerated trend.
Look at the METR graph more carefully. The Claudes which METR evaluated were released during the age which I called the GPT4o-o3 accelerated trend (except for Claude 3 Opus, but it wasn’t SOTA even in comparison with the GPT4-GPT4o trend).
With pre-RLVR models we went from a 36 second 50% time horizon to a 29 minute horizon.
Between GPT-4 and Claude-3.5 Sonnet (new) we went from 5 minutes to 29 minutes.
I’ve looked carefully at the graph, but I saw no signs of a plateau nor even a slowdown.
I’ll do some calculation to ensure I’m not missing anything obvious or deceiving myself.
I don’t any sign of a plateau here. Things were a little behind-trend right after GPT-4, but of course there will be short behind-trend periods just as there will be short above-trend periods, even assuming the trend is projectable.
I’m not sure why you are starting from GPT-4 and ending at GPT-4o. Starting with GPT-3.5, and ending with Claude 3.5 (new) seems more reasonable since these were all post-RLHF, non-reasoning models.
AFAIK the Claude-3.5 models were not trained based on data from reasoning models?
Except that Grok 4 and GPT-5 arguably already didn’t adhere to the faster doubling time. And I say “arguably” because of Grok failing some primitive tasks and Greenblatt’s pre-release prediction of GPT-5′s time horizon. While METR technically didn’t confirm the prediction, METR itself acknowledged that it ran into problems when trying to calculate GPT-5′s time horizon.
Another thing to consider is that Grok 4′s SOTA performance was achieved by using similar amounts of compute for pretraining and RL. What is Musk going to do to ensure that Grok 5 is AGI? Use some advanced architecture like neuralese?
EDIT: you mention a 5.7-month doubling time post-GPT-3.5. But there actually was a plateau or slowdown between GPT-4 and GPT-4o which was followed by the GPT4o-o3 accelerated trend.
I don’t think there was a plateau. Is there a reason you’re ignoring Claude models?
Greenblatt’s predictions don’t seem pertinent.
Look at the METR graph more carefully. The Claudes which METR evaluated were released during the age which I called the GPT4o-o3 accelerated trend (except for Claude 3 Opus, but it wasn’t SOTA even in comparison with the GPT4-GPT4o trend).
With pre-RLVR models we went from a 36 second 50% time horizon to a 29 minute horizon.
Between GPT-4 and Claude-3.5 Sonnet (new) we went from 5 minutes to 29 minutes.
I’ve looked carefully at the graph, but I saw no signs of a plateau nor even a slowdown.
I’ll do some calculation to ensure I’m not missing anything obvious or deceiving myself.
I don’t any sign of a plateau here. Things were a little behind-trend right after GPT-4, but of course there will be short behind-trend periods just as there will be short above-trend periods, even assuming the trend is projectable.
I’m not sure why you are starting from GPT-4 and ending at GPT-4o. Starting with GPT-3.5, and ending with Claude 3.5 (new) seems more reasonable since these were all post-RLHF, non-reasoning models.
AFAIK the Claude-3.5 models were not trained based on data from reasoning models?