It’s interesting that reasoning models were invented right around the time that we seemed to be reaching the end of the data/compute curve with base models. I think foundation models have had slower progress in the past two years compared to the previous two. (Though, it’s hard to say as the public now has little access to frontier base models.) But what would have happened if reasoning models had not arisen?
Slower progress overall?
Foundation models much more advanced than now?
Some other angle identified instead of reasoning?
This seems like a key uncertainty. In one mental model, we barely avoided slowdown by inventing reasoning models. So maybe reasoning models will plateau and progress will slow. In another mental model, as soon as one angle reaches diminishing returns, we immediately invent another angle, and progress will continue indefinitely.
Pretraining scaling might be relatively slow, but it hasn’t been slowing down. Pretraining compute growth continues, and will continue at a relatively steady clip through 2022-2028 (in part thanks to Nvidia’s bet on FP8 in Rubin, and also with TPUv7 and Rubin Ultra unlocking giant MoE models that the pretraining scale will be asking for). The data for pretraining shouldn’t be a problem until after 2026, there are probably more data efficient ways to setup pretraining when the amount of data becomes a constraint, including by enabling more epochs of training on repeated data. The AI companies knew this issue would come up years in advance, and almost certainly prepared the necessary algorithmic improvements.
There was an impression that pretraining was slowing down by the end of 2024, since there was essentially no pretraining scaling in publicly released models at the time since the original GPT-4 from Mar 2023. And now, in retrospect, 2023-2024 seem extremely slow in contrast with 2025, but that’s scaling of RLVR, a totally different thing. The first models that credibly take advantage of pretraining scaling since GPT-4 for improving capabilities rather than for reducing costs (and that incorporate all the RLVR things contemporary to their release) are Opus 4.5 and Gemini 3 Pro, which were only out in late 2025. And since they were likely only pretrained on 2024 levels of compute, that’s maybe 8-15 times more compute than GPT-4, while the 2028-2030 models trained on 2 GW Rubin systems might use 900 times more compute than GPT-4 in pretraining. After that, pretraining will slow down, at least if none of the AI companies start pulling a trillion dollars per year in revenue. This might well happen, in which case there’s one more step in compute scaling (after 2028-2029) similar to what’s happening every 2 years in 2022-2026, though it’ll take a few years to scale the necessary semi production to take that step with a single system that uses chips of the same generation, so maybe only by mid-2030s.
I think foundation models have had slower progress in the past two years compared to the previous two.
At the time there was a scoop by The Information, and then two or three by other newspapers, to the effect that OpenAI, Google and Anthropic all had disappointing internal results with continued Chinchilla scaling. Specifically, OpenAIs huge “Orion” model was planned to be GPT-5, but was then not released because it had disappointingly small improvements on benchmarks. It only got half-released much later for a while under the name GPT-4.5. I don’t know what the failed Anthropic/Google training runs were, but I remember there was a Gemini 2.0 Flash but no Pro, suggesting they only released a distilled 2.0 model because the full model wasn’t worth it.
My guess is that this might be (had been) a data quality issue: If you increase the model scale without increasing data quality past the GPT-4 level, the model might still get better perplexity, but not because it gets smarter, but because it starts memorizing unpredictable facts from the training data. Because there is not enough high-IQ text in the training pool which would facilitate still lower loss through more model intelligence.
The reason would be that pretraining on text is a kind of imitation learning: the model only ever learns to imitate text of human intelligence, if it is merely trained on text of human intelligence. Which limits how smart it can get.
There is also an academic debate on whether RLVR reasoning training can make a model fundamentally more intelligent in the same way scaling pretraining can, or whether RLVR just assigns higher probability to reasoning traces it could already do before via pass@k sampling. See e.g. this paper.
On the one hand, “reasoning” seems like an obvious angle to try, so the counterfactual timeline when it hasn’t happened feels very implausible to me. On the other hand, it took some time for it to take off, so maybe it’s not as trivial/obvious as it seems. (I don’t recall what exactly OpenAI did to get o1 to work.)
This is the “stacked S-curves” effect often seen in the maturing of (usually ordinary) technologies. It’s perhaps slightly unusual that it’s more pronounced and “discrete” right now (relatively few innovations leading to large amounts of progress).
The other angles are probably already out there, but haven’t given the chance to shine while the current paradigm can be sufficiently leveraged, so I’m not very hopeful on progress stalling by itself.
It’s interesting that reasoning models were invented right around the time that we seemed to be reaching the end of the data/compute curve with base models. I think foundation models have had slower progress in the past two years compared to the previous two. (Though, it’s hard to say as the public now has little access to frontier base models.) But what would have happened if reasoning models had not arisen?
Slower progress overall?
Foundation models much more advanced than now?
Some other angle identified instead of reasoning?
This seems like a key uncertainty. In one mental model, we barely avoided slowdown by inventing reasoning models. So maybe reasoning models will plateau and progress will slow. In another mental model, as soon as one angle reaches diminishing returns, we immediately invent another angle, and progress will continue indefinitely.
Pretraining scaling might be relatively slow, but it hasn’t been slowing down. Pretraining compute growth continues, and will continue at a relatively steady clip through 2022-2028 (in part thanks to Nvidia’s bet on FP8 in Rubin, and also with TPUv7 and Rubin Ultra unlocking giant MoE models that the pretraining scale will be asking for). The data for pretraining shouldn’t be a problem until after 2026, there are probably more data efficient ways to setup pretraining when the amount of data becomes a constraint, including by enabling more epochs of training on repeated data. The AI companies knew this issue would come up years in advance, and almost certainly prepared the necessary algorithmic improvements.
There was an impression that pretraining was slowing down by the end of 2024, since there was essentially no pretraining scaling in publicly released models at the time since the original GPT-4 from Mar 2023. And now, in retrospect, 2023-2024 seem extremely slow in contrast with 2025, but that’s scaling of RLVR, a totally different thing. The first models that credibly take advantage of pretraining scaling since GPT-4 for improving capabilities rather than for reducing costs (and that incorporate all the RLVR things contemporary to their release) are Opus 4.5 and Gemini 3 Pro, which were only out in late 2025. And since they were likely only pretrained on 2024 levels of compute, that’s maybe 8-15 times more compute than GPT-4, while the 2028-2030 models trained on 2 GW Rubin systems might use 900 times more compute than GPT-4 in pretraining. After that, pretraining will slow down, at least if none of the AI companies start pulling a trillion dollars per year in revenue. This might well happen, in which case there’s one more step in compute scaling (after 2028-2029) similar to what’s happening every 2 years in 2022-2026, though it’ll take a few years to scale the necessary semi production to take that step with a single system that uses chips of the same generation, so maybe only by mid-2030s.
Do we have any measures of the progress in the “quality” of base models? (As measured in terms of nats / perplexity?)
At the time there was a scoop by The Information, and then two or three by other newspapers, to the effect that OpenAI, Google and Anthropic all had disappointing internal results with continued Chinchilla scaling. Specifically, OpenAIs huge “Orion” model was planned to be GPT-5, but was then not released because it had disappointingly small improvements on benchmarks. It only got half-released much later for a while under the name GPT-4.5. I don’t know what the failed Anthropic/Google training runs were, but I remember there was a Gemini 2.0 Flash but no Pro, suggesting they only released a distilled 2.0 model because the full model wasn’t worth it.
My guess is that this might be (had been) a data quality issue: If you increase the model scale without increasing data quality past the GPT-4 level, the model might still get better perplexity, but not because it gets smarter, but because it starts memorizing unpredictable facts from the training data. Because there is not enough high-IQ text in the training pool which would facilitate still lower loss through more model intelligence.
The reason would be that pretraining on text is a kind of imitation learning: the model only ever learns to imitate text of human intelligence, if it is merely trained on text of human intelligence. Which limits how smart it can get.
There is also an academic debate on whether RLVR reasoning training can make a model fundamentally more intelligent in the same way scaling pretraining can, or whether RLVR just assigns higher probability to reasoning traces it could already do before via pass@k sampling. See e.g. this paper.
On the one hand, “reasoning” seems like an obvious angle to try, so the counterfactual timeline when it hasn’t happened feels very implausible to me. On the other hand, it took some time for it to take off, so maybe it’s not as trivial/obvious as it seems. (I don’t recall what exactly OpenAI did to get o1 to work.)
See also https://www.lesswrong.com/posts/6eguLiC2QP399GrQE/leon-lang-s-shortform?commentId=i89TRPGjkbicrWmJj
This is the “stacked S-curves” effect often seen in the maturing of (usually ordinary) technologies. It’s perhaps slightly unusual that it’s more pronounced and “discrete” right now (relatively few innovations leading to large amounts of progress).
The other angles are probably already out there, but haven’t given the chance to shine while the current paradigm can be sufficiently leveraged, so I’m not very hopeful on progress stalling by itself.