The idea that Chinchilla scaling might be slowing comes from the fact that we’ve seen a bunch of delays and disappointments in the next generation of frontier models.
GPT 4.5 was expensive and it got yanked. We’re not hearing rumors about how amazing GPT 5 is. Grok 3 scaled up and saw some improvement, but nothing that gave it an overwhelming advantage. Gemini 2.5 is solid but not transformative.
Nearly all the gains we’ve seen recently come from reasoning, which is comparatively easy to train into models. For example, DeepScaleR is a 1.8B parameter local model that is hilariously awful at everything but high school math. But a $4,500 fine tune was enough to make it competitive with frontier models in that one area. Qwen3′s small reasoning models are surprisingly strong. (Try feeding 32B or 30B A3B high school homework problems. Use Gemma3 to OCR worksheets and Qwen3 to solve them. You could just about take a scanner, a Python control script, and a printer, and build a 100% local automated homework machine.)
I’ve heard different kinds of speculation why Chinchilla scaling might be struggling:
Maybe we’re running low on good training data?
Maybe the resulting models are too large to be affordable?
Maybe the training runs are so expensive that it’s getting hard to run enough experiments to debug problems?
Maybe this stuff is just an S-curve, and it’s finally starting to flatten? Most technological S-curves outside of machine learning do eventually slow.
LLM control is frequently analogized to nuclear non-proliferation. But from what various experts and semi-experts have told me, building fission weapons is actually pretty easy. In fact, most good university engineering departments could apparently do it. Simplified, low-yield designs are even easier. But what’s harder to get in any quantity is enriched U-235 (or a substitute?). Most of the routes to enrichment are supposedly hard to hide. Because fissile material is somewhat easier to control, nuclear non-proliferation is possible.
Chinchilla scaling is similarly hard to hide. You need a big building full of a lot of expensive GPUs. If governments cared enough, they could find anyone relying on scaling laws to train the equivalent of GPT-5 or GPT-6. If you somehow got the US, China and Europe scared enough, you could shut down further scaling. If smaller countries defected, you could physically destroy data centers or their supporting power generation (just like countries sometimes threaten to do to uranium enrichment operations).
This is why “reasoning” models were such a nasty shock for me. They showed that relatively inexpensive RL could upgrade existing models with very real new capabilities and the ability to handle multi-step tasks more robustly.
Some estimates claim that training Grok 3 cost $3 billion or more. If AI non-proliferation means preventing $30 billion or $300 billion training runs, that’s probably theoretically feasible (at least in a world where powerful people fear AGI badly enough). But if AI non-proliferation involves preventing $4,500 fine tunes by random researchers (like primitive “reasoning” apparently does), that’s a much stickier situation.
So, if like Yudkowsky, you have a nasty suspicion that “If anyone builds this, everyone dies” (seriously, go preorder his book[1]), then we need to consider that AGI might arrive via another route than Chinchilla scaling. And in that case, non-proliferation might be much harder than joint US/China treaties. I don’t have any good answers for this case. But I agree with OP that we need to include it as a branch in planning scenarios. And in those scenarios, mid-tier open weight models like Qwen are potentially significant, either as a base for fine-tuning in dangerous directions, or as evidence that some non-US labs making 32B parameter models are highly capable.
What is your opinion on the recent developments of LLMs? I feel the last 9 months since your comment was made have shown they are not slowing down. I don’t think the recent METR evals show a superexponential growth, as the tasks are saturated at the high end (80% success is more in line with my model of reality), but they are still on the post-2024 higher exponential with the introduction of reasoning models. My not-super-expert knowledge is that they’re still massively scaling up and that represents the majority of the improvements, though not discounting improved algorithmic efficiency.
One of my main near-term concerns right now is that they get good at certain scientific research tasks like developing novel viruses that can be made by ordering proteins online before they become “agentic.” We’re already seeing novel mathematical and physics research, at the low end, of course, with experts say this is mostly because of time constraints on human intelligence, but that will probably change over the next 2 years so that they will actually solve problems that had humans working on them, in my opinion.
What is your opinion on the recent developments of LLMs? I feel the last 9 months since your comment was made have shown they are not slowing down.
My argument at the time was that Chinchilla scaling might be slowing down, but that there might be cheaper ways to improve LLMs. Unfortunately, I can’t evaluate the accuracy of my prediction, because I don’t know (for example) how much model size changed between Claude 3.7 and Claude 4.5 models. Did their parameter count go up? Did they increase their pre-training by an order of magnitude? Or did they just continue to lean into reasoning, more RL, and better training data?
But yes, I agree that Claude Code has improved dramatically in the last 12 months, and I doubt that it has stopped.
Now we’re in a weird place:
I see Claude Opus 4.5 and 4.6 as a pretty convincing proof of concept for AGI. These models still suffer under significant limitations, but it’s clear to me that that they do think, and that they already exceed median human intelligence on certain kinds of tasks.
But at the same time, I don’t think that current techniques will allow removing several of the those significant limitations.[1]
I don’t want to get into specifics here, because I personally fear that rigorous alignment is impossible. And so every year that nobody makes those breakthroughs is (frankly) one more year that the people I love get to live in a world where humans actually control our fate.
The idea that Chinchilla scaling might be slowing comes from the fact that we’ve seen a bunch of delays and disappointments in the next generation of frontier models.
GPT 4.5 was expensive and it got yanked. We’re not hearing rumors about how amazing GPT 5 is. Grok 3 scaled up and saw some improvement, but nothing that gave it an overwhelming advantage. Gemini 2.5 is solid but not transformative.
Nearly all the gains we’ve seen recently come from reasoning, which is comparatively easy to train into models. For example, DeepScaleR is a 1.8B parameter local model that is hilariously awful at everything but high school math. But a $4,500 fine tune was enough to make it competitive with frontier models in that one area. Qwen3′s small reasoning models are surprisingly strong. (Try feeding 32B or 30B A3B high school homework problems. Use Gemma3 to OCR worksheets and Qwen3 to solve them. You could just about take a scanner, a Python control script, and a printer, and build a 100% local automated homework machine.)
I’ve heard different kinds of speculation why Chinchilla scaling might be struggling:
Maybe we’re running low on good training data?
Maybe the resulting models are too large to be affordable?
Maybe the training runs are so expensive that it’s getting hard to run enough experiments to debug problems?
Maybe this stuff is just an S-curve, and it’s finally starting to flatten? Most technological S-curves outside of machine learning do eventually slow.
LLM control is frequently analogized to nuclear non-proliferation. But from what various experts and semi-experts have told me, building fission weapons is actually pretty easy. In fact, most good university engineering departments could apparently do it. Simplified, low-yield designs are even easier. But what’s harder to get in any quantity is enriched U-235 (or a substitute?). Most of the routes to enrichment are supposedly hard to hide. Because fissile material is somewhat easier to control, nuclear non-proliferation is possible.
Chinchilla scaling is similarly hard to hide. You need a big building full of a lot of expensive GPUs. If governments cared enough, they could find anyone relying on scaling laws to train the equivalent of GPT-5 or GPT-6. If you somehow got the US, China and Europe scared enough, you could shut down further scaling. If smaller countries defected, you could physically destroy data centers or their supporting power generation (just like countries sometimes threaten to do to uranium enrichment operations).
This is why “reasoning” models were such a nasty shock for me. They showed that relatively inexpensive RL could upgrade existing models with very real new capabilities and the ability to handle multi-step tasks more robustly.
Some estimates claim that training Grok 3 cost $3 billion or more. If AI non-proliferation means preventing $30 billion or $300 billion training runs, that’s probably theoretically feasible (at least in a world where powerful people fear AGI badly enough). But if AI non-proliferation involves preventing $4,500 fine tunes by random researchers (like primitive “reasoning” apparently does), that’s a much stickier situation.
So, if like Yudkowsky, you have a nasty suspicion that “If anyone builds this, everyone dies” (seriously, go preorder his book[1]), then we need to consider that AGI might arrive via another route than Chinchilla scaling. And in that case, non-proliferation might be much harder than joint US/China treaties. I don’t have any good answers for this case. But I agree with OP that we need to include it as a branch in planning scenarios. And in those scenarios, mid-tier open weight models like Qwen are potentially significant, either as a base for fine-tuning in dangerous directions, or as evidence that some non-US labs making 32B parameter models are highly capable.
[1] https://www.lesswrong.com/posts/iNsy7MsbodCyNTwKs/eliezer-and-i-wrote-a-book-if-anyone-builds-it-everyone-dies
What is your opinion on the recent developments of LLMs? I feel the last 9 months since your comment was made have shown they are not slowing down. I don’t think the recent METR evals show a superexponential growth, as the tasks are saturated at the high end (80% success is more in line with my model of reality), but they are still on the post-2024 higher exponential with the introduction of reasoning models. My not-super-expert knowledge is that they’re still massively scaling up and that represents the majority of the improvements, though not discounting improved algorithmic efficiency.
One of my main near-term concerns right now is that they get good at certain scientific research tasks like developing novel viruses that can be made by ordering proteins online before they become “agentic.” We’re already seeing novel mathematical and physics research, at the low end, of course, with experts say this is mostly because of time constraints on human intelligence, but that will probably change over the next 2 years so that they will actually solve problems that had humans working on them, in my opinion.
My argument at the time was that Chinchilla scaling might be slowing down, but that there might be cheaper ways to improve LLMs. Unfortunately, I can’t evaluate the accuracy of my prediction, because I don’t know (for example) how much model size changed between Claude 3.7 and Claude 4.5 models. Did their parameter count go up? Did they increase their pre-training by an order of magnitude? Or did they just continue to lean into reasoning, more RL, and better training data?
But yes, I agree that Claude Code has improved dramatically in the last 12 months, and I doubt that it has stopped.
Now we’re in a weird place:
I see Claude Opus 4.5 and 4.6 as a pretty convincing proof of concept for AGI. These models still suffer under significant limitations, but it’s clear to me that that they do think, and that they already exceed median human intelligence on certain kinds of tasks.
But at the same time, I don’t think that current techniques will allow removing several of the those significant limitations. [1]
I don’t want to get into specifics here, because I personally fear that rigorous alignment is impossible. And so every year that nobody makes those breakthroughs is (frankly) one more year that the people I love get to live in a world where humans actually control our fate.