Yes, but 2025 saw two trends: Claude 3.5 Sonnet—o3 and o3 -- GPT-5.1CodexMax with different doubling times. IIRC the earlier trend would cause superhuman coders to appear by 2028 and the later trend (which was arguably invalidated by Claude 4.5 Opus and its ~5h time horizon; see, however, twocomments pointing out that the METR benchmark is no longer as trustworthy as it once was and my potential explanation of the abnormally high 50%/80% time horizon ratio) had superhuman coders arrive in 2030 or outright hit a wall[1] before becoming superhuman.
As for the OP’s idea that coding agents are used to improving coding agents and reaching the SC, this could be unlikely because they don’t improve the underlying LLM. I remember the now-obsolete benchmarks-and-gaps model which required the SCs not just to saturate the RE-bench, but learn to actually do long tasks and handle complex codebases, which in turn requires either a big attention span of the LLM itself or careful summarisation of each method’s specification, of formatting, of other methods’ names, etc.
P.S. The latter scenario would be particularly difficult to predict as it might involve the time horizon in the METR sense behaving like ectect∞−ect. In this case the horizon would grow ~exponentially until the very last couple of doublings.
When looking for trend breaks in time series, it’s unwise to rely on eyeballing when Quandt likelihood ratio test aka sup-Wald test exists for 65 years (google it or ask an LLM to explain in layman’s terms).
I pulled the METR data and asked Gemini 3 Flash to vibecode the test, and there is a statistically significant (peak F-statistic = 7.79 corresponding to p-value about 0.03) break at Claude 3.5 Sonnet from ~8 to ~5-month doubling but not after it
Yes, but 2025 saw two trends: Claude 3.5 Sonnet—o3 and o3 -- GPT-5.1CodexMax with different doubling times. IIRC the earlier trend would cause superhuman coders to appear by 2028 and the later trend (which was arguably invalidated by Claude 4.5 Opus and its ~5h time horizon; see, however, two comments pointing out that the METR benchmark is no longer as trustworthy as it once was and my potential explanation of the abnormally high 50%/80% time horizon ratio) had superhuman coders arrive in 2030 or outright hit a wall[1] before becoming superhuman.
As for the OP’s idea that coding agents are used to improving coding agents and reaching the SC, this could be unlikely because they don’t improve the underlying LLM. I remember the now-obsolete benchmarks-and-gaps model which required the SCs not just to saturate the RE-bench, but learn to actually do long tasks and handle complex codebases, which in turn requires either a big attention span of the LLM itself or careful summarisation of each method’s specification, of formatting, of other methods’ names, etc.
P.S. The latter scenario would be particularly difficult to predict as it might involve the time horizon in the METR sense behaving like ectect∞−ect. In this case the horizon would grow ~exponentially until the very last couple of doublings.
Or become neuralese with consequences as disastrous as the lack of Safer-1 to test alignment.
When looking for trend breaks in time series, it’s unwise to rely on eyeballing when Quandt likelihood ratio test aka sup-Wald test exists for 65 years (google it or ask an LLM to explain in layman’s terms).
I pulled the METR data and asked Gemini 3 Flash to vibecode the test, and there is a statistically significant (peak F-statistic = 7.79 corresponding to p-value about 0.03) break at Claude 3.5 Sonnet from ~8 to ~5-month doubling but not after it