There’s a high bar to clear here: LLM capabilities have so far progressed at a hyper-exponential rate with no signs of a slowdown [1].
7-month doubling time (early models)
5.7-month doubling time (post-GPT-3.5)
4.2-month doubling time (post-o1)
So, an argument for the claim that we’re about to plateau has to be more convincing than induction from this strong pattern we’ve observed since at least the release of GPT-2 in February 2019.
Your argument does not pass this high bar. You have made the same kind of argument that has been made again and again (which have been proven wrong again and again) throughout the past seven years we have been scaling up GPTs.
One can’t simply point out the ways in which the things that LLMs cannot currently do are hard in a way in which the things that LLMs currently can do are not. Of course, the things they cannot do are different from the things they can. This has also been true of the capability gains we have observed so far, so it cannot be used as evidence that this observed pattern is unlikely to continue.
So, you would need to go further. You would need to demonstrate that they’re different in a way that meaningfully departs from how past, successfully gained capabilities differed from earlier ones.
To make this more concrete, claims based on supposed architectural limitations are not an exception to this rule: many such claims have been made in the past and proven incorrect. The base rate here is unfavourable to the pessimist.
Even solid proofs of fundamental limitations are not by their nature sufficient: these tend to be arguments that LLMs cannot do X by means Y, rather than arguments that LLMs cannot do X.
To be convincing, you have to make an argument that fundamentally differentiates your objection from past failed objections.
I explained in my post that I believe the benchmarks are mainly measuring shallow thinking. The benchmarks include things like completing a single word of code or solving arithmetic problems. These unambiguously fall within what I described as shallow thinking. They measure existing judgement/knowledge, not the ability to form new insights.
Deep thinking has not progressed hyper-exponentially. LLMs are essentially shrimp-level when it comes to deep thinking, in my opinion. LLMs still make extremely basic mistakes that a human 5-year-old would never make. This is undeniable if you actually use them for solving problems.
One can’t simply point out the ways in which the things that LLMs cannot currently do are hard in a way in which the things that LLMs currently can do are not
The distinction between deep and shallow thinking is real and fundamental. Deep thinking is non-polynomial in its time complexity. I’m not moving the goalposts to include only whatever LLMs happen to be bad at right now. They have always been bad at deep thinking, and continue to be. All the gains measured by the benchmarks are gains in shallow thinking.
To be convincing, you have to make an argument that fundamentally differentiates your objection from past failed objections.
I believe I have done so, by claiming deep thinking is of a fundamentally different nature than shallow thinking, and denying any significant progress has been made on this front.
If you disagree, fine. Like I said, I can’t prove anything, I’m just putting forward a hypothesis. But you don’t get to say I’ve been proven wrong. If you want to come up with some way of measuring deep thinking and prove LLMs are or are not good at it, go ahead. Until that work has been done, I haven’t been proven wrong, and we can’t say either way.
(Certain things are easy to measure/benchmark, and these things tend to also require only shallow thinking. Things that require deep thinking are hard to measure for the same reason they require deep thinking, and so they don’t make it into benchmarks. The only way I know how to measure deep thinking is personal judgement, which obviously isn’t convincing. But the fact this work is hard to do doesn’t mean we just conclude that I’m wrong and you’re right.)
Except that Grok 4 and GPT-5 arguably already didn’t adhere to the faster doubling time. And I say “arguably” because of Grok failing some primitive tasks and Greenblatt’s pre-release prediction of GPT-5′s time horizon. While METR technically didn’t confirm the prediction, METR itself acknowledged that it ran into problems when trying to calculate GPT-5′s time horizon.
EDIT: you mention a 5.7-month doubling time post-GPT-3.5. But there actually was a plateau or slowdown between GPT-4 and GPT-4o which was followed by the GPT4o-o3 accelerated trend.
Look at the METR graph more carefully. The Claudes which METR evaluated were released during the age which I called the GPT4o-o3 accelerated trend (except for Claude 3 Opus, but it wasn’t SOTA even in comparison with the GPT4-GPT4o trend).
With pre-RLVR models we went from a 36 second 50% time horizon to a 29 minute horizon.
Between GPT-4 and Claude-3.5 Sonnet (new) we went from 5 minutes to 29 minutes.
I’ve looked carefully at the graph, but I saw no signs of a plateau nor even a slowdown.
I’ll do some calculation to ensure I’m not missing anything obvious or deceiving myself.
I don’t any sign of a plateau here. Things were a little behind-trend right after GPT-4, but of course there will be short behind-trend periods just as there will be short above-trend periods, even assuming the trend is projectable.
I’m not sure why you are starting from GPT-4 and ending at GPT-4o. Starting with GPT-3.5, and ending with Claude 3.5 (new) seems more reasonable since these were all post-RLHF, non-reasoning models.
AFAIK the Claude-3.5 models were not trained based on data from reasoning models?
There’s a high bar to clear here: LLM capabilities have so far progressed at a hyper-exponential rate with no signs of a slowdown [1].
7-month doubling time (early models)
5.7-month doubling time (post-GPT-3.5)
4.2-month doubling time (post-o1)
So, an argument for the claim that we’re about to plateau has to be more convincing than induction from this strong pattern we’ve observed since at least the release of GPT-2 in February 2019.
Your argument does not pass this high bar. You have made the same kind of argument that has been made again and again (which have been proven wrong again and again) throughout the past seven years we have been scaling up GPTs.
One can’t simply point out the ways in which the things that LLMs cannot currently do are hard in a way in which the things that LLMs currently can do are not. Of course, the things they cannot do are different from the things they can. This has also been true of the capability gains we have observed so far, so it cannot be used as evidence that this observed pattern is unlikely to continue.
So, you would need to go further. You would need to demonstrate that they’re different in a way that meaningfully departs from how past, successfully gained capabilities differed from earlier ones.
To make this more concrete, claims based on supposed architectural limitations are not an exception to this rule: many such claims have been made in the past and proven incorrect. The base rate here is unfavourable to the pessimist.
Even solid proofs of fundamental limitations are not by their nature sufficient: these tend to be arguments that LLMs cannot do X by means Y, rather than arguments that LLMs cannot do X.
To be convincing, you have to make an argument that fundamentally differentiates your objection from past failed objections.
[1] based on METR’s research https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/
I explained in my post that I believe the benchmarks are mainly measuring shallow thinking. The benchmarks include things like completing a single word of code or solving arithmetic problems. These unambiguously fall within what I described as shallow thinking. They measure existing judgement/knowledge, not the ability to form new insights.
Deep thinking has not progressed hyper-exponentially. LLMs are essentially shrimp-level when it comes to deep thinking, in my opinion. LLMs still make extremely basic mistakes that a human 5-year-old would never make. This is undeniable if you actually use them for solving problems.
The distinction between deep and shallow thinking is real and fundamental. Deep thinking is non-polynomial in its time complexity. I’m not moving the goalposts to include only whatever LLMs happen to be bad at right now. They have always been bad at deep thinking, and continue to be. All the gains measured by the benchmarks are gains in shallow thinking.
I believe I have done so, by claiming deep thinking is of a fundamentally different nature than shallow thinking, and denying any significant progress has been made on this front.
If you disagree, fine. Like I said, I can’t prove anything, I’m just putting forward a hypothesis. But you don’t get to say I’ve been proven wrong. If you want to come up with some way of measuring deep thinking and prove LLMs are or are not good at it, go ahead. Until that work has been done, I haven’t been proven wrong, and we can’t say either way.
(Certain things are easy to measure/benchmark, and these things tend to also require only shallow thinking. Things that require deep thinking are hard to measure for the same reason they require deep thinking, and so they don’t make it into benchmarks. The only way I know how to measure deep thinking is personal judgement, which obviously isn’t convincing. But the fact this work is hard to do doesn’t mean we just conclude that I’m wrong and you’re right.)
Except that Grok 4 and GPT-5 arguably already didn’t adhere to the faster doubling time. And I say “arguably” because of Grok failing some primitive tasks and Greenblatt’s pre-release prediction of GPT-5′s time horizon. While METR technically didn’t confirm the prediction, METR itself acknowledged that it ran into problems when trying to calculate GPT-5′s time horizon.
Another thing to consider is that Grok 4′s SOTA performance was achieved by using similar amounts of compute for pretraining and RL. What is Musk going to do to ensure that Grok 5 is AGI? Use some advanced architecture like neuralese?
EDIT: you mention a 5.7-month doubling time post-GPT-3.5. But there actually was a plateau or slowdown between GPT-4 and GPT-4o which was followed by the GPT4o-o3 accelerated trend.
I don’t think there was a plateau. Is there a reason you’re ignoring Claude models?
Greenblatt’s predictions don’t seem pertinent.
Look at the METR graph more carefully. The Claudes which METR evaluated were released during the age which I called the GPT4o-o3 accelerated trend (except for Claude 3 Opus, but it wasn’t SOTA even in comparison with the GPT4-GPT4o trend).
With pre-RLVR models we went from a 36 second 50% time horizon to a 29 minute horizon.
Between GPT-4 and Claude-3.5 Sonnet (new) we went from 5 minutes to 29 minutes.
I’ve looked carefully at the graph, but I saw no signs of a plateau nor even a slowdown.
I’ll do some calculation to ensure I’m not missing anything obvious or deceiving myself.
I don’t any sign of a plateau here. Things were a little behind-trend right after GPT-4, but of course there will be short behind-trend periods just as there will be short above-trend periods, even assuming the trend is projectable.
I’m not sure why you are starting from GPT-4 and ending at GPT-4o. Starting with GPT-3.5, and ending with Claude 3.5 (new) seems more reasonable since these were all post-RLHF, non-reasoning models.
AFAIK the Claude-3.5 models were not trained based on data from reasoning models?