Look at the METR graph more carefully. The Claudes which METR evaluated were released during the age which I called the GPT4o-o3 accelerated trend (except for Claude 3 Opus, but it wasn’t SOTA even in comparison with the GPT4-GPT4o trend).
With pre-RLVR models we went from a 36 second 50% time horizon to a 29 minute horizon.
Between GPT-4 and Claude-3.5 Sonnet (new) we went from 5 minutes to 29 minutes.
I’ve looked carefully at the graph, but I saw no signs of a plateau nor even a slowdown.
I’ll do some calculation to ensure I’m not missing anything obvious or deceiving myself.
I don’t any sign of a plateau here. Things were a little behind-trend right after GPT-4, but of course there will be short behind-trend periods just as there will be short above-trend periods, even assuming the trend is projectable.
I’m not sure why you are starting from GPT-4 and ending at GPT-4o. Starting with GPT-3.5, and ending with Claude 3.5 (new) seems more reasonable since these were all post-RLHF, non-reasoning models.
AFAIK the Claude-3.5 models were not trained based on data from reasoning models?
I don’t think there was a plateau. Is there a reason you’re ignoring Claude models?
Greenblatt’s predictions don’t seem pertinent.
Look at the METR graph more carefully. The Claudes which METR evaluated were released during the age which I called the GPT4o-o3 accelerated trend (except for Claude 3 Opus, but it wasn’t SOTA even in comparison with the GPT4-GPT4o trend).
With pre-RLVR models we went from a 36 second 50% time horizon to a 29 minute horizon.
Between GPT-4 and Claude-3.5 Sonnet (new) we went from 5 minutes to 29 minutes.
I’ve looked carefully at the graph, but I saw no signs of a plateau nor even a slowdown.
I’ll do some calculation to ensure I’m not missing anything obvious or deceiving myself.
I don’t any sign of a plateau here. Things were a little behind-trend right after GPT-4, but of course there will be short behind-trend periods just as there will be short above-trend periods, even assuming the trend is projectable.
I’m not sure why you are starting from GPT-4 and ending at GPT-4o. Starting with GPT-3.5, and ending with Claude 3.5 (new) seems more reasonable since these were all post-RLHF, non-reasoning models.
AFAIK the Claude-3.5 models were not trained based on data from reasoning models?