Here’s an argument for a capabilities plateau at the level of GPT-4 that I haven’t seen discussed before. I’m interested in any holes anyone can spot in it.
Consider the following chain of logic:
The pretraining scaling laws only say that, even for a fixed training method, increasing the model’s size and the amount of data you train on increases the model’s capabilities – as measured by loss, performance on benchmarks, and the intuitive sense of how generally smart a model is.
Nothing says that increasing a model’s parameter-count and the amount of compute spent on training it is the only way to increase its capabilities. If you have two training methods A and B, it’s possible that the B-trained X-sized model matches the performance of the A-trained 10X-sized model.
Empirical evidence: Sonnet 3.5 (at least the not-new one), Qwen-2.5-70B, and Llama 3-72B all have 70-ish billion parameters, i. e., less than GPT-3. Yet, their performance is at least on par with that of GPT-4 circa early 2023.
Therefore, it is possible to “jump up” a tier of capabilities, by any reasonable metric, using a fixed model size but improving the training methods.
The latest set of GPT-4-sized models (Opus 3.5, Orion, Gemini 1.5 Pro?) are presumably trained using the current-best methods. That is: they should be expected to be at the effective capability level of a model that is 10X GPT-4′s size yet trained using early-2023 methods. Call that level “GPT-5”.
Therefore, the jump from GPT-4 to GPT-5, holding the training method fixed at early 2023, is the same as the jump from the early GPT-4 to the current (non-reasoning) SotA, i. e., to Sonnet 3.5.1.
(Nevermind that Sonnet 3.5.1 is likely GPT-3-sized too, it still beats the current-best GPT-4-sized models as well. I guess it straight up punches up two tiers?)
The jump from GPT-3 to GPT-4 is dramatically bigger than the jump from early-2023 SotA to late-2024 SotA. I. e.: 4-to-5 is less than 3-to-4.
Consider a model 10X bigger than GPT-4 but trained using the current-best training methods; an effective GPT-6. We should expect this jump to be at most as significant as the capability jump from early-2023 to late-2024. By point (7), it’s likely even less significant than that.
Empirical evidence: Despite the proliferation of all sorts of better training methods, including e. g. the suite of tricks that allowed DeepSeek to train a near-SotA-level model for pocket change, none of the known non-reasoning models have managed to escape the neighbourhood of GPT-4, and none of the known models (including the reasoning models) have escaped that neighbourhood in domains without easy verification.
Intuitively, if we now know how to reach levels above early-2023!GPT-4 using 20x fewer resources, we should be able to shoot well past early-2023!GPT-4 using 1x as much resources – and some of the latest training runs have to have spent 10x the resources that went into original GPT-4.
E. g., OpenAI’s rumored Orion, which was presumably both trained using more compute than GPT-4, and via better methods than were employed for the original GPT-4, and which still reportedly underperformed.
Similar for Opus 3.5: even if it didn’t “fail” as such, the fact that they choose to keep it in-house instead of giving public access for e. g. $200/month suggests it’s not that much better than Sonnet 3.5.1.
Yet, we have still not left the rough capability neighbourhood of early-2023!GPT-4. (Certainly no jumps similar to the one from GPT-3 to GPT-4.)
Therefore, all known avenues of capability progress aside from the o-series have plateaued. You can make the current SotA more efficient in all kinds of ways, but you can’t advance the frontier.
Are there issues with this logic?
The main potential one is if all models that “punch up a tier” are directly trained on the outputs of the models of the higher tier. In this case, to have a GPT-5-capabiliies model of GPT-4′s size, it had to have been trained on the outputs of a GPT-5-sized model, which do not exist yet. “The current-best training methods”, then, do not yet scale to GPT-4-sized models, because they rely on access to a “dumbly trained” GPT-5-sized model. Therefore, although the current-best GPT-3-sized models can be considered at or above the level of early-2023 GPT-4, the current-best GPT-4-sized models cannot be considered to be at the level of GPT-5 if it were trained using early-2023 methods.
Note, however: this would then imply that all excitement about (currently known) algorithmic improvements is hot air. If the capability frontier cannot be pushed by improving the training methods in any (known) way – if training a GPT-4-sized model on well-structured data, and on reasoning traces from a reasoning model, et cetera, isn’t enough to push it to GPT-5′s level – then pretraining transformers is the only known game in town, as far as general-purpose capability-improvement goes. Synthetic data and other tricks can allow you to reach the frontier in all manners of more efficient ways, but not move past it.
Basically, it seems to me that one of those must be true:
Capabilities can be advanced by improving training methods (by e. g. using synthetic data).
… in which case we should expect current models to be at the level of GPT-5 or above. And yet they are not much more impressive than GPT-4, which means further scaling will be a disappointment.
Capabilities cannot be advanced by improving training methods.
… in which case scaling pretraining is still the only known method of general capability advancement.
… and if the Orion rumors are true, it seems that even a straightforward scale-up to GPT-4.5ish’s level doesn’t yield much (or: yields less than was expected).
(This still leaves one potential avenue of general capabilities progress: figuring out how to generalize o-series’ trick to domains without easy automatic verification. But if the above reasoning has no major holes, that’s currently the only known promising avenue.)
Given some amount of compute, a compute optimal model tries to get the best perplexity out of it when training on a given dataset, by choosing model size, amount of data, and architecture. An algorithmic improvement in pretraining enables getting the same perplexity by training on data from the same dataset with less compute, achieving better compute efficiency (measured as its compute multiplier).
Many models aren’t trained compute optimally, they are instead overtrained (the model is smaller, trained on more data). This looks impressive, since a smaller model is now much better, but this is not an improvement in compute efficiency, doesn’t in any way indicate that it became possible to train a better compute optimal model with a given amount of compute. The data and post-training also recently got better, which creates the illusion of algorithmic progress in pretraining, but their effect is bounded (while RL doesn’t take off), doesn’t get better according to pretraining scaling laws once much more data becomes necessary. There is enough data until 2026-2028, but not enough good data.
I don’t think the cumulative compute multiplier since GPT-4 is that high, I’m guessing 3x, except perhaps for DeepSeek-V3, which wasn’t trained compute optimally and didn’t use a lot of compute, and so it remains unknown what happens if its recipe is used compute optimally with more compute.
The amount of raw compute since original GPT-4 only increased maybe 5x, from 2e25 FLOPs to about 1e26 FLOPs, and it’s unclear if there were any compute optimal models trained on notably more compute than original GPT-4. We know Llama-3-405B is compute optimal, but it’s not MoE, so has lower compute efficiency and only used 4e25 FLOPs. Probably Claude 3 Opus is compute optimal, but unclear if it used a lot of compute compared to original GPT-4.
If there was a 6e25 FLOPs compute optimal model with a 3x compute multiplier over GPT-4, it’s therefore only trained for 9x more effective compute than original GPT-4. The 100K H100s clusters have likely recently trained a new generation of base models for about 3e26 FLOPs, possibly a 45x improvement in effective compute over original GPT-4, but there’s no word on whether any of them were compute optimal (except perhaps Claude 3.5 Opus), and it’s unclear if there is an actual 3x compute multiplier over GPT-4 that made it all the way into pretraining of frontier models. Also, waiting for NVL72 GB200s (that are much better at inference for larger models), non-Google labs might want to delay deploying compute optimal models in the 1e26-5e26 FLOPs range until later in 2025.
Comparing GPT-3 to GPT-4 gives very little signal on how much of the improvement is from compute, and so how much should be expected beyond GPT-4 from more compute. While modern models are making good use of not being compute optimal by using fewer active parameters, GPT-3 was instead undertrained, being both larger and less performant than the hypothetical compute optimal alternative. It also wasn’t a MoE model. And most of the bounded low hanging fruit that is not about pretraining efficiency hasn’t been applied to it.
So the currently deployed models don’t demonstrate the results of the experiment in training a much more compute efficient model on much more compute. And the previous leaps in capability are in large part explained by things that are not improvement in compute efficiency or increase in amount of compute. But in 2026-2027, 1 GW training systems will train models with 250x compute of original GPT-4. And probably in 2028-2029, 5 GW training systems will train models with 2500x raw compute of original GPT-4. With a compute multiplier of 5x-10x from algorithmic improvements plausible by that time, we get 10,000x-25,000x original GPT-4 in effective compute. This is enough of a leap that lack of significant improvement from only 9x of currently deployed models (or 20x-45x of non-deployed newer models, rumored to be underwhelming) is not a strong indication of what happens by 2028-2029 (from scaling of pretraining alone).
I don’t think the cumulative compute multiplier since GPT-4 is that high, I’m guessing 3x, except perhaps for DeepSeek-V3, which wasn’t trained compute optimally and didn’t use a lot of compute, and so it remains unknown what happens if its recipe is used compute optimally with more compute.
How did DeepSeek accidentally happen to invest precisely the amount of compute into V3 and r1 that would get them into the capability region of GPT-4/o1, despite using training methods that clearly have wildly different returns on compute investment?
Like, GPT-4 was supposedly trained for $100 million, and V3 for $5.5 million. Yet, they’re roughly at the same level. That should be very surprising. Investing a very different amount of money into V3′s training should’ve resulted in it either massively underperforming GPT-4, or massively overperforming, not landing precisely in its neighbourhood!
Consider this graph. If we find some training method A, and discover that investing $100 million in it lands us at just above “dumb human”, and then find some other method B with a very different ROI, and invest $5.5 million in it, the last thing we should expect is to again land near “dumb human”.
Or consider this trivial toy model: You have two linear functions, f(x) = Ax and g(x) = Bx, where x is the compute invested, output is the intelligence of the model, and f and g are different training methods. You pick some x effectively at random (whatever amount of money you happened to have lying around), plug it into f, and get, say, 120. Then you pick a different random value of x, plug it into g, and get… 120 again. Despite the fact that the multipliers A and B are likely very different, and you used very different x-values as well. How come?
The explanations that come to mind are:
It actually is just that much of a freaky coincidence.
DeepSeek have a superintelligent GPT-6 equivalent that they trained for $10 million in their basement, and V3/r1 are just flexes that they specifically engineered to match GPT-4-ish level.
DeepSeek directly trained on GPT-4 outputs, effectively just distilling GPT-4 into their model, hence the anchoring.
DeepSeek kept investing and tinkering until getting to GPT-4ish level, and then stopped immediately after attaining it.
GPT-4ish neighbourhood is where LLM pretraining plateaus, which is why this capability level acts as a sort of “attractor” into which all training runs, no matter how different, fall.
How did DeepSeek accidentally happen to invest precisely the amount of compute into V3 and r1 that would get them into the capability region of GPT-4/o1
Selection effect. If DeepSeek-V2.5 was this good, we would be talking about it instead.
GPT-4 was supposedly trained for $100 million, and V3 for $5.5 million
Original GPT-4 is 2e25 FLOPs and compute optimal, V3 is about 5e24 FLOPs and overtrained (400 tokens/parameter, about 10x-20x), so a compute optimal model with the same architecture would only need about 3e24 FLOPs of raw compute[1]. Original GPT-4 was trained in 2022 on A100s and needed a lot of them, while in 2024 it could be trained on 8K H100s in BF16. DeepSeek-V3 is trained in FP8, doubling the FLOP/s, so the FLOPs of original GPT-4 could be produced in FP8 by mere 4K H100s. DeepSeek-V3 was trained on 2K H800s, whose performance is about that of 1.5K H100s. So the cost only has to differ by about 3x, not 20x, when comparing a compute optimal variant of DeepSeek-V3 with original GPT-4, using the same hardware and training with the same floating point precision.
The relevant comparison is with GPT-4o though, not original GPT-4. Since GPT-4o was trained in late 2023 or early 2024, there were 30K H100s clusters around, which makes 8e25 FLOPs of raw compute plausible (assuming it’s in BF16). It might be overtrained, so make that 4e25 FLOPs for a compute optimal model with the same architecture. Thus when comparing architectures alone, GPT-4o probably uses about 15x more compute than DeepSeek-V3.
toy model … f(x) = Ax and g(x) = Bx, where x is the compute invested
Returns on compute are logarithmic though, advantage of a $150 billion training system over a $150 million one is merely twice that of $150 billion over $5 billion or $5 billion over $150 million. Restrictions on access to compute can only be overcome with 30x compute multipliers, and at least DeepSeek-V3 is going to be reproduced using the big compute of US training systems shortly, so that advantage is already gone.
GPT-3 was instead undertrained, being both larger and less performant than the hypothetical compute optimal alternative
You’re more fluent in the scaling laws than me: is there an easy way to roughly estimate how much compute would’ve been needed to train a model as capable as GPT-3 if it were done Chinchilla-optimally + with MoEs? That is: what’s the actual effective “scale” of GPT-3?
(Training GPT-3 reportedly took 3e23 FLOPS, and GPT-4 2e25 FLOPS. Naively, the scale-up factor is 67x. But if GPT-3′s level is attainable using less compute, the effective scale-up is bigger. I’m wondering how much bigger.)
IsoFLOP curves for dependence of perplexity on log-data seem mostly symmetric (as in Figure 2 of Llama 3 report), so overtraining by 10x probably has about the same effect as undertraining by 10x. Starting with a compute optimal model, increasing its data 10x while decreasing its active parameters 3x (making it 30x overtrained, using 3x more compute) preserves perplexity (see Figure 1).
GPT-3 is a 3e23 FLOPs dense transformer with 175B parameters trained for 300B tokens (see Table D.1). If Chinchilla’s compute optimal 20 tokens/parameter is approximately correct for GPT-3, it’s 10x undertrained. Interpolating from the above 30x overtraining example, a compute optimal model needs about 1.5e23 FLOPs to get the same perplexity.
(The effect from undertraining of GPT-3 turns out to be quite small, reducing effective compute by only 2x. Probably wasn’t worth mentioning compared to everything else about it that’s different from GPT-4.)
in retrospect, we know from chinchilla that gpt3 allocated its compute too much to parameters as opposed to training tokens. so it’s not surprising that models since then are smaller. model size is a less fundamental measure of model cost than pretraining compute. from here on i’m going to assume that whenever you say size you meant to say compute.
obviously it is possible to train better models using the same amount of compute. one way to see this is that it is definitely possible to train worse models with the same compute, and it is implausible that the current model production methodology is the optimal one.
it is unknown how much compute the latest models were trained with, and therefore what compute efficiency win they obtain over gpt4. it is unknown how much more effective compute gpt4 used than gpt3. we can’t really make strong assumptions using public information about what kinds of compute efficiency improvements have been discovered by various labs at different points in time. therefore, we can’t really make any strong conclusions about whether the current models are not that much better than gpt4 because of (a) a shortage of compute, (b) a shortage of compute efficiency improvements, or (c) a diminishing return of capability wrt effective compute.
One possible answer is that we are in what one might call an “unhobbling overhang.”
Aschenbrenner uses the term “unhobbling” for changes that make existing model capabilities possible (or easier) for users to reliably access in practice.
His presentation emphasizes the role of unhobbling as yet another factor growing the stock of (practically accessible) capabilities over time. IIUC, he argues that better/bigger pretraining would produce such growth (to some extent) even without more work on unhobbling, but in fact we’re also getting better at unhobbling over time, which leads to even more growth.
That is, Aschenbrenner treats pretraining improvements and unhobbling as economic substitutes: you can improve “practically accessible capabilities” to the same extent by doing more of either one even in the absence of the other, and if you do both at once that’s even better.
However, one could also view the pair more like economic complements. Under this view, when you pretrain a model at “the next tier up,” you also need to do novel unhobbling research to “bring out” the new capabilities unlocked at that tier. If you only scale up, while re-using the unhobbling tech of yesteryear, most of the new capabilities will be hidden/inaccessible, and this will look to you like diminishing downstream returns to pretraining investment, even though the model is really getting smarter under the hood.
This could be true if, for instance, the new capabilities are fundamentally different in some way that older unhobbling techniques were not planned to reckon with. Which seems plausible IMO: if all you have is GPT-2 (much less GPT-1, or char-rnn, or...), you’re not going to invest a lot of effort into letting the model “use a computer” or combine modalities or do long-form reasoning or even be an HHH chatbot, because the model is kind of obviously too dumb to do these things usefully no matter how much help you give it.
(Relatedly, one could argue that fundamentally better capabilities tend to go hand in hand with tasks that operate on a longer horizon and involve richer interaction with the real world, and that this almost inevitably causes the appearance of “diminishing returns” in the interval between creating a model smart enough to perform some newly long/rich task and the point where the model has actually been tuned and scaffolded to do the task. If your new model is finally smart enough to “use a computer” via a screenshot/mouseclick interface, it’s probably also great at short/narrow tasks like NLI or whatever, but the benchmarks for those tasks were already maxed out by the last generation so you’re not going to see a measurable jump in anything until you build out “computer use” as a new feature.)
This puts a different spin on the two concurrent observations that (a) “frontier companies report ‘diminishing returns’ from pretraining” and (b) “frontier labs are investing in stuff like o1 and computer use.”
Under the “unhobbling is a substitute” view, (b) likely reflects an attempt to find something new to “patch the hole” introduced by (a).
But under the “unhobbling is a complement” view, (a) is instead simply a reflection of the fact that (b) is currently a work in progress: unhobbling is the limiting bottleneck right now, not pretraining compute, and the frontier labs are intensively working on removing this bottleneck so that their latest pretrained models can really shine.
(On an anecdotal/vibes level, this also agrees with my own experience when interacting with frontier LLMs. When I can’t get something done, these days I usually feel like the limiting factor is not the model’s “intelligence” – at least not only that, and not that in an obvious or uncomplicated way – but rather that I am running up against the limitations of the HHH assistant paradigm; the model feels intuitively smarter in principle than “the character it’s playing” is allowed to be in practice. See my comments here and here.)
That seems maybe right, in that I don’t see holes in your logic on LLM progression to date, off the top of my head.
It also lines up with a speculation I’ve always had. In theory LLMs are predictors, but in practice, are they pretty much imitators? If you’re imitating human language, you’re capped at reproducing human verbal intelligence (other modalities are not reproducing human thought so not capped; but they don’t contribute as much in practice without imitating human thought).
I’ve always suspected LLMs will plateau. Unfortunately I see plenty of routes to improving using runtime compute/CoT and continuous learning. Those are central to human intelligence.
LLMs already have slightly-greater-than-human system 1 verbal intelligence, leaving some gaps where humans rely on other systems (e.g., visual imagination for tasks like tracking how many cars I have or tic-tac-toe). As we reproduce the systems that give humans system 2 abilities by skillfully iterating system 1, as o1 has started to do, they’ll be noticeably smarter than humans.
The difficulty of finding new routes forward in this scenario would produce a very slow takeoff. That might be a big benefit for alignment.
It also lines up with a speculation I’ve always had. In theory LLMs are predictors, but in practice, are they pretty much imitators?
Yep, I think that’s basically the case.
@nostalgebraist makes an excellent point that eliciting any latent superhuman capabilities which bigger models might have is an art of its own, and that “just train chatbots” doesn’t exactly cut it, for this task. Maybe that’s where some additional capabilities progress might still come from.
But the AI industry (both AGI labs and the menagerie of startups and open-source enthusiasts) has so far been either unwilling or unable to move past the chatbot paradigm.
(Also, I would tentatively guess that this type of progress is not existentially threatening. It’d yield us a bunch of nice software tools, not a lightcone-eating monstrosity.)
I agree that chatbot progress is probably not existentially threatening. But it’s all too short a leap to making chatbots power general agents. The labs have claimed to be willing and enthusiastic about moving to an agent paradigm. And I’m afraid that a proliferation of even weakly superhuman or even roughly parahuman agents could be existentially threatening.
I spell out my logic for how short the leap might be from current chatbots to takeover-capable AGI agents in my argument for short timelines being quite possible. I do think we’ve still got a good shot of aligning that type of LLM agent AGI since it’s a nearly best-case scenario. RL even in o1 is really mostly used for making it accurately follow instructions, which is at least roughly the ideal alignment goal of Corrigibility as Singular Target. Even if we lose faithful chain of thought and orgs don’t take alignment that seriously, I think those advantages of not really being a maximizer and having corrigibility might win out.
That in combination with the slower takeoff make me tempted to believe its actually a good thing if we forge forward, even though I’m not at all confident that this will actually get us aligned AGI or good outcomes. I just don’t see a better realistic path.
Here’s an argument for a capabilities plateau at the level of GPT-4 that I haven’t seen discussed before. I’m interested in any holes anyone can spot in it.
One obvious hole would be that capabilities did not, in fact, plateau at the level of GPT-4.
I thought the argument was that progress has slowed down immensely. The softer form of this argument is that LLMs won’t plateau but progress will slow to such a crawl that other methods will surpass them. The arrival of o1 and o3 says this has already happened, at least in limited domains—and hybrid training methods and perhaps hybrid systems probably will proceed to surpass base LLMs in all domains.
There’s been incremental improvement and various quality-of-life features like more pleasant chatbot personas, tool use, multimodality, gradually better math/programming performance that make the models useful for gradually bigger demographics, et cetera.
But it’s all incremental, no jumps like 2-to-3 or 3-to-4.
I see, thanks. Just to make sure I’m understanding you correctly, are you excluding the reasoning models, or are you saying there was no jump from GPT-4 to o3? (At first I thought you were excluding them in this comment, until I noticed the “gradually better math/programming performance.”)
Here’s an argument for a capabilities plateau at the level of GPT-4 that I haven’t seen discussed before. I’m interested in any holes anyone can spot in it.
Consider the following chain of logic:
The pretraining scaling laws only say that, even for a fixed training method, increasing the model’s size and the amount of data you train on increases the model’s capabilities – as measured by loss, performance on benchmarks, and the intuitive sense of how generally smart a model is.
Nothing says that increasing a model’s parameter-count and the amount of compute spent on training it is the only way to increase its capabilities. If you have two training methods A and B, it’s possible that the B-trained X-sized model matches the performance of the A-trained 10X-sized model.
Empirical evidence: Sonnet 3.5 (at least the not-new one), Qwen-2.5-70B, and Llama 3-72B all have 70-ish billion parameters, i. e., less than GPT-3. Yet, their performance is at least on par with that of GPT-4 circa early 2023.
Therefore, it is possible to “jump up” a tier of capabilities, by any reasonable metric, using a fixed model size but improving the training methods.
The latest set of GPT-4-sized models (Opus 3.5, Orion, Gemini 1.5 Pro?) are presumably trained using the current-best methods. That is: they should be expected to be at the effective capability level of a model that is 10X GPT-4′s size yet trained using early-2023 methods. Call that level “GPT-5”.
Therefore, the jump from GPT-4 to GPT-5, holding the training method fixed at early 2023, is the same as the jump from the early GPT-4 to the current (non-reasoning) SotA, i. e., to Sonnet 3.5.1.
(Nevermind that Sonnet 3.5.1 is likely GPT-3-sized too, it still beats the current-best GPT-4-sized models as well. I guess it straight up punches up two tiers?)
The jump from GPT-3 to GPT-4 is dramatically bigger than the jump from early-2023 SotA to late-2024 SotA. I. e.: 4-to-5 is less than 3-to-4.
Consider a model 10X bigger than GPT-4 but trained using the current-best training methods; an effective GPT-6. We should expect this jump to be at most as significant as the capability jump from early-2023 to late-2024. By point (7), it’s likely even less significant than that.
Empirical evidence: Despite the proliferation of all sorts of better training methods, including e. g. the suite of tricks that allowed DeepSeek to train a near-SotA-level model for pocket change, none of the known non-reasoning models have managed to escape the neighbourhood of GPT-4, and none of the known models (including the reasoning models) have escaped that neighbourhood in domains without easy verification.
Intuitively, if we now know how to reach levels above early-2023!GPT-4 using 20x fewer resources, we should be able to shoot well past early-2023!GPT-4 using 1x as much resources – and some of the latest training runs have to have spent 10x the resources that went into original GPT-4.
E. g., OpenAI’s rumored Orion, which was presumably both trained using more compute than GPT-4, and via better methods than were employed for the original GPT-4, and which still reportedly underperformed.
Similar for Opus 3.5: even if it didn’t “fail” as such, the fact that they choose to keep it in-house instead of giving public access for e. g. $200/month suggests it’s not that much better than Sonnet 3.5.1.
Yet, we have still not left the rough capability neighbourhood of early-2023!GPT-4. (Certainly no jumps similar to the one from GPT-3 to GPT-4.)
Therefore, all known avenues of capability progress aside from the o-series have plateaued. You can make the current SotA more efficient in all kinds of ways, but you can’t advance the frontier.
Are there issues with this logic?
The main potential one is if all models that “punch up a tier” are directly trained on the outputs of the models of the higher tier. In this case, to have a GPT-5-capabiliies model of GPT-4′s size, it had to have been trained on the outputs of a GPT-5-sized model, which do not exist yet. “The current-best training methods”, then, do not yet scale to GPT-4-sized models, because they rely on access to a “dumbly trained” GPT-5-sized model. Therefore, although the current-best GPT-3-sized models can be considered at or above the level of early-2023 GPT-4, the current-best GPT-4-sized models cannot be considered to be at the level of GPT-5 if it were trained using early-2023 methods.
Note, however: this would then imply that all excitement about (currently known) algorithmic improvements is hot air. If the capability frontier cannot be pushed by improving the training methods in any (known) way – if training a GPT-4-sized model on well-structured data, and on reasoning traces from a reasoning model, et cetera, isn’t enough to push it to GPT-5′s level – then pretraining transformers is the only known game in town, as far as general-purpose capability-improvement goes. Synthetic data and other tricks can allow you to reach the frontier in all manners of more efficient ways, but not move past it.
Basically, it seems to me that one of those must be true:
Capabilities can be advanced by improving training methods (by e. g. using synthetic data).
… in which case we should expect current models to be at the level of GPT-5 or above. And yet they are not much more impressive than GPT-4, which means further scaling will be a disappointment.
Capabilities cannot be advanced by improving training methods.
… in which case scaling pretraining is still the only known method of general capability advancement.
… and if the Orion rumors are true, it seems that even a straightforward scale-up to GPT-4.5ish’s level doesn’t yield much (or: yields less than was expected).
(This still leaves one potential avenue of general capabilities progress: figuring out how to generalize o-series’ trick to domains without easy automatic verification. But if the above reasoning has no major holes, that’s currently the only known promising avenue.)
Given some amount of compute, a compute optimal model tries to get the best perplexity out of it when training on a given dataset, by choosing model size, amount of data, and architecture. An algorithmic improvement in pretraining enables getting the same perplexity by training on data from the same dataset with less compute, achieving better compute efficiency (measured as its compute multiplier).
Many models aren’t trained compute optimally, they are instead overtrained (the model is smaller, trained on more data). This looks impressive, since a smaller model is now much better, but this is not an improvement in compute efficiency, doesn’t in any way indicate that it became possible to train a better compute optimal model with a given amount of compute. The data and post-training also recently got better, which creates the illusion of algorithmic progress in pretraining, but their effect is bounded (while RL doesn’t take off), doesn’t get better according to pretraining scaling laws once much more data becomes necessary. There is enough data until 2026-2028, but not enough good data.
I don’t think the cumulative compute multiplier since GPT-4 is that high, I’m guessing 3x, except perhaps for DeepSeek-V3, which wasn’t trained compute optimally and didn’t use a lot of compute, and so it remains unknown what happens if its recipe is used compute optimally with more compute.
The amount of raw compute since original GPT-4 only increased maybe 5x, from 2e25 FLOPs to about 1e26 FLOPs, and it’s unclear if there were any compute optimal models trained on notably more compute than original GPT-4. We know Llama-3-405B is compute optimal, but it’s not MoE, so has lower compute efficiency and only used 4e25 FLOPs. Probably Claude 3 Opus is compute optimal, but unclear if it used a lot of compute compared to original GPT-4.
If there was a 6e25 FLOPs compute optimal model with a 3x compute multiplier over GPT-4, it’s therefore only trained for 9x more effective compute than original GPT-4. The 100K H100s clusters have likely recently trained a new generation of base models for about 3e26 FLOPs, possibly a 45x improvement in effective compute over original GPT-4, but there’s no word on whether any of them were compute optimal (except perhaps Claude 3.5 Opus), and it’s unclear if there is an actual 3x compute multiplier over GPT-4 that made it all the way into pretraining of frontier models. Also, waiting for NVL72 GB200s (that are much better at inference for larger models), non-Google labs might want to delay deploying compute optimal models in the 1e26-5e26 FLOPs range until later in 2025.
Comparing GPT-3 to GPT-4 gives very little signal on how much of the improvement is from compute, and so how much should be expected beyond GPT-4 from more compute. While modern models are making good use of not being compute optimal by using fewer active parameters, GPT-3 was instead undertrained, being both larger and less performant than the hypothetical compute optimal alternative. It also wasn’t a MoE model. And most of the bounded low hanging fruit that is not about pretraining efficiency hasn’t been applied to it.
So the currently deployed models don’t demonstrate the results of the experiment in training a much more compute efficient model on much more compute. And the previous leaps in capability are in large part explained by things that are not improvement in compute efficiency or increase in amount of compute. But in 2026-2027, 1 GW training systems will train models with 250x compute of original GPT-4. And probably in 2028-2029, 5 GW training systems will train models with 2500x raw compute of original GPT-4. With a compute multiplier of 5x-10x from algorithmic improvements plausible by that time, we get 10,000x-25,000x original GPT-4 in effective compute. This is enough of a leap that lack of significant improvement from only 9x of currently deployed models (or 20x-45x of non-deployed newer models, rumored to be underwhelming) is not a strong indication of what happens by 2028-2029 (from scaling of pretraining alone).
Coming back to this in the wake of DeepSeek r1...
How did DeepSeek accidentally happen to invest precisely the amount of compute into V3 and r1 that would get them into the capability region of GPT-4/o1, despite using training methods that clearly have wildly different returns on compute investment?
Like, GPT-4 was supposedly trained for $100 million, and V3 for $5.5 million. Yet, they’re roughly at the same level. That should be very surprising. Investing a very different amount of money into V3′s training should’ve resulted in it either massively underperforming GPT-4, or massively overperforming, not landing precisely in its neighbourhood!
Consider this graph. If we find some training method A, and discover that investing $100 million in it lands us at just above “dumb human”, and then find some other method B with a very different ROI, and invest $5.5 million in it, the last thing we should expect is to again land near “dumb human”.
Or consider this trivial toy model: You have two linear functions, f(x) = Ax and g(x) = Bx, where x is the compute invested, output is the intelligence of the model, and f and g are different training methods. You pick some x effectively at random (whatever amount of money you happened to have lying around), plug it into f, and get, say, 120. Then you pick a different random value of x, plug it into g, and get… 120 again. Despite the fact that the multipliers A and B are likely very different, and you used very different x-values as well. How come?
The explanations that come to mind are:
It actually is just that much of a freaky coincidence.
DeepSeek have a superintelligent GPT-6 equivalent that they trained for $10 million in their basement, and V3/r1 are just flexes that they specifically engineered to match GPT-4-ish level.
DeepSeek directly trained on GPT-4 outputs, effectively just distilling GPT-4 into their model, hence the anchoring.
DeepSeek kept investing and tinkering until getting to GPT-4ish level, and then stopped immediately after attaining it.
GPT-4ish neighbourhood is where LLM pretraining plateaus, which is why this capability level acts as a sort of “attractor” into which all training runs, no matter how different, fall.
Selection effect. If DeepSeek-V2.5 was this good, we would be talking about it instead.
Original GPT-4 is 2e25 FLOPs and compute optimal, V3 is about 5e24 FLOPs and overtrained (400 tokens/parameter, about 10x-20x), so a compute optimal model with the same architecture would only need about 3e24 FLOPs of raw compute[1]. Original GPT-4 was trained in 2022 on A100s and needed a lot of them, while in 2024 it could be trained on 8K H100s in BF16. DeepSeek-V3 is trained in FP8, doubling the FLOP/s, so the FLOPs of original GPT-4 could be produced in FP8 by mere 4K H100s. DeepSeek-V3 was trained on 2K H800s, whose performance is about that of 1.5K H100s. So the cost only has to differ by about 3x, not 20x, when comparing a compute optimal variant of DeepSeek-V3 with original GPT-4, using the same hardware and training with the same floating point precision.
The relevant comparison is with GPT-4o though, not original GPT-4. Since GPT-4o was trained in late 2023 or early 2024, there were 30K H100s clusters around, which makes 8e25 FLOPs of raw compute plausible (assuming it’s in BF16). It might be overtrained, so make that 4e25 FLOPs for a compute optimal model with the same architecture. Thus when comparing architectures alone, GPT-4o probably uses about 15x more compute than DeepSeek-V3.
Returns on compute are logarithmic though, advantage of a $150 billion training system over a $150 million one is merely twice that of $150 billion over $5 billion or $5 billion over $150 million. Restrictions on access to compute can only be overcome with 30x compute multipliers, and at least DeepSeek-V3 is going to be reproduced using the big compute of US training systems shortly, so that advantage is already gone.
That is, raw utilized compute. I’m assuming the same compute utilization for all models.
I buy that 1 and 4 is the case, combined with Deepseek probably being satisfied that GPT-4-level models were achieved.
Edit: I did not mean to imply that GPT-4ish neighbourhood is where LLM pretraining plateaus at all, @Thane Ruthenis.
Thanks!
You’re more fluent in the scaling laws than me: is there an easy way to roughly estimate how much compute would’ve been needed to train a model as capable as GPT-3 if it were done Chinchilla-optimally + with MoEs? That is: what’s the actual effective “scale” of GPT-3?
(Training GPT-3 reportedly took 3e23 FLOPS, and GPT-4 2e25 FLOPS. Naively, the scale-up factor is 67x. But if GPT-3′s level is attainable using less compute, the effective scale-up is bigger. I’m wondering how much bigger.)
IsoFLOP curves for dependence of perplexity on log-data seem mostly symmetric (as in Figure 2 of Llama 3 report), so overtraining by 10x probably has about the same effect as undertraining by 10x. Starting with a compute optimal model, increasing its data 10x while decreasing its active parameters 3x (making it 30x overtrained, using 3x more compute) preserves perplexity (see Figure 1).
GPT-3 is a 3e23 FLOPs dense transformer with 175B parameters trained for 300B tokens (see Table D.1). If Chinchilla’s compute optimal 20 tokens/parameter is approximately correct for GPT-3, it’s 10x undertrained. Interpolating from the above 30x overtraining example, a compute optimal model needs about 1.5e23 FLOPs to get the same perplexity.
(The effect from undertraining of GPT-3 turns out to be quite small, reducing effective compute by only 2x. Probably wasn’t worth mentioning compared to everything else about it that’s different from GPT-4.)
in retrospect, we know from chinchilla that gpt3 allocated its compute too much to parameters as opposed to training tokens. so it’s not surprising that models since then are smaller. model size is a less fundamental measure of model cost than pretraining compute. from here on i’m going to assume that whenever you say size you meant to say compute.
obviously it is possible to train better models using the same amount of compute. one way to see this is that it is definitely possible to train worse models with the same compute, and it is implausible that the current model production methodology is the optimal one.
it is unknown how much compute the latest models were trained with, and therefore what compute efficiency win they obtain over gpt4. it is unknown how much more effective compute gpt4 used than gpt3. we can’t really make strong assumptions using public information about what kinds of compute efficiency improvements have been discovered by various labs at different points in time. therefore, we can’t really make any strong conclusions about whether the current models are not that much better than gpt4 because of (a) a shortage of compute, (b) a shortage of compute efficiency improvements, or (c) a diminishing return of capability wrt effective compute.
One possible answer is that we are in what one might call an “unhobbling overhang.”
Aschenbrenner uses the term “unhobbling” for changes that make existing model capabilities possible (or easier) for users to reliably access in practice.
His presentation emphasizes the role of unhobbling as yet another factor growing the stock of (practically accessible) capabilities over time. IIUC, he argues that better/bigger pretraining would produce such growth (to some extent) even without more work on unhobbling, but in fact we’re also getting better at unhobbling over time, which leads to even more growth.
That is, Aschenbrenner treats pretraining improvements and unhobbling as economic substitutes: you can improve “practically accessible capabilities” to the same extent by doing more of either one even in the absence of the other, and if you do both at once that’s even better.
However, one could also view the pair more like economic complements. Under this view, when you pretrain a model at “the next tier up,” you also need to do novel unhobbling research to “bring out” the new capabilities unlocked at that tier. If you only scale up, while re-using the unhobbling tech of yesteryear, most of the new capabilities will be hidden/inaccessible, and this will look to you like diminishing downstream returns to pretraining investment, even though the model is really getting smarter under the hood.
This could be true if, for instance, the new capabilities are fundamentally different in some way that older unhobbling techniques were not planned to reckon with. Which seems plausible IMO: if all you have is GPT-2 (much less GPT-1, or char-rnn, or...), you’re not going to invest a lot of effort into letting the model “use a computer” or combine modalities or do long-form reasoning or even be an HHH chatbot, because the model is kind of obviously too dumb to do these things usefully no matter how much help you give it.
(Relatedly, one could argue that fundamentally better capabilities tend to go hand in hand with tasks that operate on a longer horizon and involve richer interaction with the real world, and that this almost inevitably causes the appearance of “diminishing returns” in the interval between creating a model smart enough to perform some newly long/rich task and the point where the model has actually been tuned and scaffolded to do the task. If your new model is finally smart enough to “use a computer” via a screenshot/mouseclick interface, it’s probably also great at short/narrow tasks like NLI or whatever, but the benchmarks for those tasks were already maxed out by the last generation so you’re not going to see a measurable jump in anything until you build out “computer use” as a new feature.)
This puts a different spin on the two concurrent observations that (a) “frontier companies report ‘diminishing returns’ from pretraining” and (b) “frontier labs are investing in stuff like o1 and computer use.”
Under the “unhobbling is a substitute” view, (b) likely reflects an attempt to find something new to “patch the hole” introduced by (a).
But under the “unhobbling is a complement” view, (a) is instead simply a reflection of the fact that (b) is currently a work in progress: unhobbling is the limiting bottleneck right now, not pretraining compute, and the frontier labs are intensively working on removing this bottleneck so that their latest pretrained models can really shine.
(On an anecdotal/vibes level, this also agrees with my own experience when interacting with frontier LLMs. When I can’t get something done, these days I usually feel like the limiting factor is not the model’s “intelligence” – at least not only that, and not that in an obvious or uncomplicated way – but rather that I am running up against the limitations of the HHH assistant paradigm; the model feels intuitively smarter in principle than “the character it’s playing” is allowed to be in practice. See my comments here and here.)
That seems maybe right, in that I don’t see holes in your logic on LLM progression to date, off the top of my head.
It also lines up with a speculation I’ve always had. In theory LLMs are predictors, but in practice, are they pretty much imitators? If you’re imitating human language, you’re capped at reproducing human verbal intelligence (other modalities are not reproducing human thought so not capped; but they don’t contribute as much in practice without imitating human thought).
I’ve always suspected LLMs will plateau. Unfortunately I see plenty of routes to improving using runtime compute/CoT and continuous learning. Those are central to human intelligence.
LLMs already have slightly-greater-than-human system 1 verbal intelligence, leaving some gaps where humans rely on other systems (e.g., visual imagination for tasks like tracking how many cars I have or tic-tac-toe). As we reproduce the systems that give humans system 2 abilities by skillfully iterating system 1, as o1 has started to do, they’ll be noticeably smarter than humans.
The difficulty of finding new routes forward in this scenario would produce a very slow takeoff. That might be a big benefit for alignment.
Yep, I think that’s basically the case.
@nostalgebraist makes an excellent point that eliciting any latent superhuman capabilities which bigger models might have is an art of its own, and that “just train chatbots” doesn’t exactly cut it, for this task. Maybe that’s where some additional capabilities progress might still come from.
But the AI industry (both AGI labs and the menagerie of startups and open-source enthusiasts) has so far been either unwilling or unable to move past the chatbot paradigm.
(Also, I would tentatively guess that this type of progress is not existentially threatening. It’d yield us a bunch of nice software tools, not a lightcone-eating monstrosity.)
I agree that chatbot progress is probably not existentially threatening. But it’s all too short a leap to making chatbots power general agents. The labs have claimed to be willing and enthusiastic about moving to an agent paradigm. And I’m afraid that a proliferation of even weakly superhuman or even roughly parahuman agents could be existentially threatening.
I spell out my logic for how short the leap might be from current chatbots to takeover-capable AGI agents in my argument for short timelines being quite possible. I do think we’ve still got a good shot of aligning that type of LLM agent AGI since it’s a nearly best-case scenario. RL even in o1 is really mostly used for making it accurately follow instructions, which is at least roughly the ideal alignment goal of Corrigibility as Singular Target. Even if we lose faithful chain of thought and orgs don’t take alignment that seriously, I think those advantages of not really being a maximizer and having corrigibility might win out.
That in combination with the slower takeoff make me tempted to believe its actually a good thing if we forge forward, even though I’m not at all confident that this will actually get us aligned AGI or good outcomes. I just don’t see a better realistic path.
One obvious hole would be that capabilities did not, in fact, plateau at the level of GPT-4.
I thought the argument was that progress has slowed down immensely. The softer form of this argument is that LLMs won’t plateau but progress will slow to such a crawl that other methods will surpass them. The arrival of o1 and o3 says this has already happened, at least in limited domains—and hybrid training methods and perhaps hybrid systems probably will proceed to surpass base LLMs in all domains.
There’s been incremental improvement and various quality-of-life features like more pleasant chatbot personas, tool use, multimodality, gradually better math/programming performance that make the models useful for gradually bigger demographics, et cetera.
But it’s all incremental, no jumps like 2-to-3 or 3-to-4.
I see, thanks. Just to make sure I’m understanding you correctly, are you excluding the reasoning models, or are you saying there was no jump from GPT-4 to o3? (At first I thought you were excluding them in this comment, until I noticed the “gradually better math/programming performance.”)
I think GPT-4 to o3 represent non-incremental narrow progress, but only, at best, incremental general progress.
(It’s possible that o3 does “unlock” transfer learning, or that o4 will do that, etc., but we’ve seen no indication of that so far.)