There were multiplereports claiming that scaling base LLM pretraining yielded unexpected diminishing returns for several new frontier models in 2024, like OpenAI’s Orion, which was apparently planned to be GPT-5. They mention a lack of high quality training data, which being the cause would not be surprising, as the Chinchilla scaling law only applies to perplexity, not necessarily to practical (e.g. benchmark) performance. Base language models perform a form of imitation learning, and it seems that you don’t get performance that is significantly smarter than the humans who wrote the text in the pretraining data, even if perplexity keeps improving.
Since pretraining compute has in the past been a major bottleneck for frontier LLM performance, a now reduced effect of pretraining means that algorithmic progress within a lab is now more important than it was two years ago. Which would mean the relative importance of having a lot of compute has gone down, and the relative importance of having highly capable AI researchers (which can improve model performance through better AI architectures or training procedures) has gone up. The ability of the AI engineers seems to be much less dependent on available money than compute resources. Which would explain why e.g. Microsoft or Apple don’t have highly competitive models, despite large financial resources, and why xAI’s Grok 3 isn’t very far beyond DeepSeek’s R1, despite a vastly greater compute budget.
Now it seems possible that this changes in the future, e.g. when performance starts to strongly depend on inference compute (i.e. not just logarithmically), or when pre-training switches from primarily text to primarily sensory data (like video), which wouldn’t be bottlenecked by imitation learning on human-written text. Another possibility is that pre-training on synthetic LLM outputs, like CoTs, could provide the necessary superhuman text for the pretraining data. But none of this is currently the case, as far as I can tell.
Pretraining on a $150bn system in 2028 gives 150x compute compared to Grok 3 (which seems to be a 3e26 FLOPs model). We haven’t seen what happens if DeepSeek-V3 methods are used in pretraining on a $5bn system that trained Grok 3 in 2025 (which would 100x its compute), or a $20bn system in 2026 (to further 8x the FLOPs).
Chinchilla scaling law only applies to perplexity, not necessarily to practical (e.g. benchmark) performance
I think perplexity is a better measure of general intelligence than any legible benchmark. There are rumors that in some settings R1-like methods only started showing signs of life for GPT-4 level models where exactly the same thing didn’t work for weaker models[1]. Something else might first start working with the kind of perplexity that a competent lab can concoct in a 5e27 FLOPs model, even if it can later be adopted for weaker models.
lack of high quality training data
This is an example of a compute multiplier that doesn’t scale, and the usual story is that there are many algorithmic advancements with the same character, they help at 1e21 FLOPs but become mostly useless at 1e24 FLOPs. The distinction between perplexity and benchmarks in measuring compute multipliers (keeping the dataset unchanged) might be a good proxy for predicting which is which.
you don’t get performance that is significantly smarter than the humans who wrote the text in the pretraining data
Prediction of details can make use of arbitrarily high levels of capability, vastly exceeding that of the authors of the predicted text. What the token prediction objective gives you is generality and grounding in the world, even if it seems to be inefficient compared to imagined currently-unavailable alternatives.
Before 2024, only OpenAI (and briefly Google) had a GPT-4 level model, while in 2024 GPT-4 level models became ubiquitous. This might explain how a series of reproductions of o1-like long reasoning performance followed in quick succession, in a way that doesn’t significantly rely on secrets leaking from OpenAI.
Chinchilla scaling law only applies to perplexity, not necessarily to practical (e.g. benchmark) performance
I think perplexity is a better measure of general intelligence than any legible benchmark. There are rumors that in some settings R1-like methods only started showing signs of life for GPT-4 level models where exactly the same thing didn’t work for weaker models[1]. Something else might first start working with the kind of perplexity that a competent lab can concoct in a 5e27 FLOPs model, even if it can later be adopted for weaker models.
But GPT-4 didn’t just have better perplexity than previous models, it also had substantially better downstream performance. To me it seems more likely that better downstream performance is responsible for the model being well-suited for reasoning RL, since this is what we would intuitively describe as its degree of “intelligence”, and intelligence seems important when teaching a model how to reason, while its not clear what perplexity itself would be useful for. (One could probably test this by training a GPT-4 scale model with similar perplexity but on bad training data, such that it only reaches the downstream performance of older models. Then I predict that it would be as bad as those older models when doing reasoning RL. But of course this is a test far too expensive to carry out.)
you don’t get performance that is significantly smarter than the humans who wrote the text in the pretraining data
Prediction of details can make use of arbitrarily high levels of capability, vastly exceeding that of the authors of the predicted text. What the token prediction objective gives you is generality and grounding in the world, even if it seems to be inefficient compared to imagined currently-unavailable alternatives.
You may train a model on text typed by little children, such that the model is able to competently imitate a child typing, but then the resulting model performance wouldn’t significantly exceed that of a child, even though the model uses a lot of compute. Training on text doesn’t really give a lot of direct grounding in the world, because text represents real world data that is compressed and filtered by human brains, and their intelligence acts as a fundamental bottleneck. Imagine you are a natural scientist, but instead of making direct observations in the world, you are locked in a room and limited to listening to what a little kid, who saw the natural world, happens to say about it. After listening to it for a while, at some point you wouldn’t learn much more from it about the world.
There were multiple reports claiming that scaling base LLM pretraining yielded unexpected diminishing returns for several new frontier models in 2024, like OpenAI’s Orion, which was apparently planned to be GPT-5. They mention a lack of high quality training data, which being the cause would not be surprising, as the Chinchilla scaling law only applies to perplexity, not necessarily to practical (e.g. benchmark) performance. Base language models perform a form of imitation learning, and it seems that you don’t get performance that is significantly smarter than the humans who wrote the text in the pretraining data, even if perplexity keeps improving.
Since pretraining compute has in the past been a major bottleneck for frontier LLM performance, a now reduced effect of pretraining means that algorithmic progress within a lab is now more important than it was two years ago. Which would mean the relative importance of having a lot of compute has gone down, and the relative importance of having highly capable AI researchers (which can improve model performance through better AI architectures or training procedures) has gone up. The ability of the AI engineers seems to be much less dependent on available money than compute resources. Which would explain why e.g. Microsoft or Apple don’t have highly competitive models, despite large financial resources, and why xAI’s Grok 3 isn’t very far beyond DeepSeek’s R1, despite a vastly greater compute budget.
Now it seems possible that this changes in the future, e.g. when performance starts to strongly depend on inference compute (i.e. not just logarithmically), or when pre-training switches from primarily text to primarily sensory data (like video), which wouldn’t be bottlenecked by imitation learning on human-written text. Another possibility is that pre-training on synthetic LLM outputs, like CoTs, could provide the necessary superhuman text for the pretraining data. But none of this is currently the case, as far as I can tell.
Pretraining on a $150bn system in 2028 gives 150x compute compared to Grok 3 (which seems to be a 3e26 FLOPs model). We haven’t seen what happens if DeepSeek-V3 methods are used in pretraining on a $5bn system that trained Grok 3 in 2025 (which would 100x its compute), or a $20bn system in 2026 (to further 8x the FLOPs).
I think perplexity is a better measure of general intelligence than any legible benchmark. There are rumors that in some settings R1-like methods only started showing signs of life for GPT-4 level models where exactly the same thing didn’t work for weaker models[1]. Something else might first start working with the kind of perplexity that a competent lab can concoct in a 5e27 FLOPs model, even if it can later be adopted for weaker models.
This is an example of a compute multiplier that doesn’t scale, and the usual story is that there are many algorithmic advancements with the same character, they help at 1e21 FLOPs but become mostly useless at 1e24 FLOPs. The distinction between perplexity and benchmarks in measuring compute multipliers (keeping the dataset unchanged) might be a good proxy for predicting which is which.
Prediction of details can make use of arbitrarily high levels of capability, vastly exceeding that of the authors of the predicted text. What the token prediction objective gives you is generality and grounding in the world, even if it seems to be inefficient compared to imagined currently-unavailable alternatives.
Before 2024, only OpenAI (and briefly Google) had a GPT-4 level model, while in 2024 GPT-4 level models became ubiquitous. This might explain how a series of reproductions of o1-like long reasoning performance followed in quick succession, in a way that doesn’t significantly rely on secrets leaking from OpenAI.
But GPT-4 didn’t just have better perplexity than previous models, it also had substantially better downstream performance. To me it seems more likely that better downstream performance is responsible for the model being well-suited for reasoning RL, since this is what we would intuitively describe as its degree of “intelligence”, and intelligence seems important when teaching a model how to reason, while its not clear what perplexity itself would be useful for. (One could probably test this by training a GPT-4 scale model with similar perplexity but on bad training data, such that it only reaches the downstream performance of older models. Then I predict that it would be as bad as those older models when doing reasoning RL. But of course this is a test far too expensive to carry out.)
You may train a model on text typed by little children, such that the model is able to competently imitate a child typing, but then the resulting model performance wouldn’t significantly exceed that of a child, even though the model uses a lot of compute. Training on text doesn’t really give a lot of direct grounding in the world, because text represents real world data that is compressed and filtered by human brains, and their intelligence acts as a fundamental bottleneck. Imagine you are a natural scientist, but instead of making direct observations in the world, you are locked in a room and limited to listening to what a little kid, who saw the natural world, happens to say about it. After listening to it for a while, at some point you wouldn’t learn much more from it about the world.