Pretraining scaling might be relatively slow, but it hasn’t been slowing down. Pretraining compute growth continues, and will continue at a relatively steady clip through 2022-2028 (in part thanks to Nvidia’s bet on FP8 in Rubin, and also with TPUv7 and Rubin Ultra unlocking giant MoE models that the pretraining scale will be asking for). The data for pretraining shouldn’t be a problem until after 2026, there are probably more data efficient ways to setup pretraining when the amount of data becomes a constraint, including by enabling more epochs of training on repeated data. The AI companies knew this issue would come up years in advance, and almost certainly prepared the necessary algorithmic improvements.
There was an impression that pretraining was slowing down by the end of 2024, since there was essentially no pretraining scaling in publicly released models at the time since the original GPT-4 from Mar 2023. And now, in retrospect, 2023-2024 seem extremely slow in contrast with 2025, but that’s scaling of RLVR, a totally different thing. The first models that credibly take advantage of pretraining scaling since GPT-4 for improving capabilities rather than for reducing costs (and that incorporate all the RLVR things contemporary to their release) are Opus 4.5 and Gemini 3 Pro, which were only out in late 2025. And since they were likely only pretrained on 2024 levels of compute, that’s maybe 8-15 times more compute than GPT-4, while the 2028-2030 models trained on 2 GW Rubin systems might use 900 times more compute than GPT-4 in pretraining. After that, pretraining will slow down, at least if none of the AI companies start pulling a trillion dollars per year in revenue. This might well happen, in which case there’s one more step in compute scaling (after 2028-2029) similar to what’s happening every 2 years in 2022-2026, though it’ll take a few years to scale the necessary semi production to take that step with a single system that uses chips of the same generation, so maybe only by mid-2030s.
Pretraining scaling might be relatively slow, but it hasn’t been slowing down. Pretraining compute growth continues, and will continue at a relatively steady clip through 2022-2028 (in part thanks to Nvidia’s bet on FP8 in Rubin, and also with TPUv7 and Rubin Ultra unlocking giant MoE models that the pretraining scale will be asking for). The data for pretraining shouldn’t be a problem until after 2026, there are probably more data efficient ways to setup pretraining when the amount of data becomes a constraint, including by enabling more epochs of training on repeated data. The AI companies knew this issue would come up years in advance, and almost certainly prepared the necessary algorithmic improvements.
There was an impression that pretraining was slowing down by the end of 2024, since there was essentially no pretraining scaling in publicly released models at the time since the original GPT-4 from Mar 2023. And now, in retrospect, 2023-2024 seem extremely slow in contrast with 2025, but that’s scaling of RLVR, a totally different thing. The first models that credibly take advantage of pretraining scaling since GPT-4 for improving capabilities rather than for reducing costs (and that incorporate all the RLVR things contemporary to their release) are Opus 4.5 and Gemini 3 Pro, which were only out in late 2025. And since they were likely only pretrained on 2024 levels of compute, that’s maybe 8-15 times more compute than GPT-4, while the 2028-2030 models trained on 2 GW Rubin systems might use 900 times more compute than GPT-4 in pretraining. After that, pretraining will slow down, at least if none of the AI companies start pulling a trillion dollars per year in revenue. This might well happen, in which case there’s one more step in compute scaling (after 2028-2029) similar to what’s happening every 2 years in 2022-2026, though it’ll take a few years to scale the necessary semi production to take that step with a single system that uses chips of the same generation, so maybe only by mid-2030s.
Do we have any measures of the progress in the “quality” of base models? (As measured in terms of nats / perplexity?)