I think foundation models have had slower progress in the past two years compared to the previous two.
At the time there was a scoop by The Information, and then two or three by other newspapers, to the effect that OpenAI, Google and Anthropic all had disappointing internal results with continued Chinchilla scaling. Specifically, OpenAIs huge “Orion” model was planned to be GPT-5, but was then not released because it had disappointingly small improvements on benchmarks. It only got half-released much later for a while under the name GPT-4.5. I don’t know what the failed Anthropic/Google training runs were, but I remember there was a Gemini 2.0 Flash but no Pro, suggesting they only released a distilled 2.0 model because the full model wasn’t worth it.
My guess is that this might be (had been) a data quality issue: If you increase the model scale without increasing data quality past the GPT-4 level, the model might still get better perplexity, but not because it gets smarter, but because it starts memorizing unpredictable facts from the training data. Because there is not enough high-IQ text in the training pool which would facilitate still lower loss through more model intelligence.
The reason would be that pretraining on text is a kind of imitation learning: the model only ever learns to imitate text of human intelligence, if it is merely trained on text of human intelligence. Which limits how smart it can get.
There is also an academic debate on whether RLVR reasoning training can make a model fundamentally more intelligent in the same way scaling pretraining can, or whether RLVR just assigns higher probability to reasoning traces it could already do before via pass@k sampling. See e.g. this paper.
At the time there was a scoop by The Information, and then two or three by other newspapers, to the effect that OpenAI, Google and Anthropic all had disappointing internal results with continued Chinchilla scaling. Specifically, OpenAIs huge “Orion” model was planned to be GPT-5, but was then not released because it had disappointingly small improvements on benchmarks. It only got half-released much later for a while under the name GPT-4.5. I don’t know what the failed Anthropic/Google training runs were, but I remember there was a Gemini 2.0 Flash but no Pro, suggesting they only released a distilled 2.0 model because the full model wasn’t worth it.
My guess is that this might be (had been) a data quality issue: If you increase the model scale without increasing data quality past the GPT-4 level, the model might still get better perplexity, but not because it gets smarter, but because it starts memorizing unpredictable facts from the training data. Because there is not enough high-IQ text in the training pool which would facilitate still lower loss through more model intelligence.
The reason would be that pretraining on text is a kind of imitation learning: the model only ever learns to imitate text of human intelligence, if it is merely trained on text of human intelligence. Which limits how smart it can get.
There is also an academic debate on whether RLVR reasoning training can make a model fundamentally more intelligent in the same way scaling pretraining can, or whether RLVR just assigns higher probability to reasoning traces it could already do before via pass@k sampling. See e.g. this paper.