I fear your concerns are very real. I’ve spent a lot of time running experiments on the mid-sized Qwen3 models (32B, 30B A3B), and they are strongly competitive with frontier models up through gpt-4o-1120. The latter writes better and has more personality, but the former are more likely to pass your high school exams.
What happened here? Well, two things. First, the Alibaba Group is competent and knows what it’s doing. But more importantly, it turned out that “reasoning” was surprisingly easy, and everyone cloned it within a few months, sometimes on budgets of less than $5,000. And a well-built reasoning model can be much stronger than GPT 4o on complex tasks.
As long as we relied on the Chinchilla scaling laws to improve frontier models, every frontier model cost far more than the last. This made AI possible to control, at least in theory. But Chinchilla scaling finally seems to be slowing, and further advancements will likely come from unexpected directions.
And some of those further advancements may turn out to be like reasoning, something that can be trained into models for $5,000. Or perhaps it will require fresh base models, but the underlying technique will be obvious enough that any serious lab can replicate it.
In other words: We need to consider the scenario where one good paper might make it obvious how to train a 200B parameter model into a weak AGI.
I think the only way we survive this is a global halt with teeth. Maybe Eliezer’s book will convince some people. Maybe we’ll get a nasty public scare that makes politicians freak out. I strongly suspect we will not be able to align an ASI any more than we can fly to the moon by flapping our arms.
The idea that Chinchilla scaling might be slowing comes from the fact that we’ve seen a bunch of delays and disappointments in the next generation of frontier models.
GPT 4.5 was expensive and it got yanked. We’re not hearing rumors about how amazing GPT 5 is. Grok 3 scaled up and saw some improvement, but nothing that gave it an overwhelming advantage. Gemini 2.5 is solid but not transformative.
Nearly all the gains we’ve seen recently come from reasoning, which is comparatively easy to train into models. For example, DeepScaleR is a 1.8B parameter local model that is hilariously awful at everything but high school math. But a $4,500 fine tune was enough to make it competitive with frontier models in that one area. Qwen3′s small reasoning models are surprisingly strong. (Try feeding 32B or 30B A3B high school homework problems. Use Gemma3 to OCR worksheets and Qwen3 to solve them. You could just about take a scanner, a Python control script, and a printer, and build a 100% local automated homework machine.)
I’ve heard different kinds of speculation why Chinchilla scaling might be struggling:
Maybe we’re running low on good training data?
Maybe the resulting models are too large to be affordable?
Maybe the training runs are so expensive that it’s getting hard to run enough experiments to debug problems?
Maybe this stuff is just an S-curve, and it’s finally starting to flatten? Most technological S-curves outside of machine learning do eventually slow.
LLM control is frequently analogized to nuclear non-proliferation. But from what various experts and semi-experts have told me, building fission weapons is actually pretty easy. In fact, most good university engineering departments could apparently do it. Simplified, low-yield designs are even easier. But what’s harder to get in any quantity is enriched U-235 (or a substitute?). Most of the routes to enrichment are supposedly hard to hide. Because fissile material is somewhat easier to control, nuclear non-proliferation is possible.
Chinchilla scaling is similarly hard to hide. You need a big building full of a lot of expensive GPUs. If governments cared enough, they could find anyone relying on scaling laws to train the equivalent of GPT-5 or GPT-6. If you somehow got the US, China and Europe scared enough, you could shut down further scaling. If smaller countries defected, you could physically destroy data centers or their supporting power generation (just like countries sometimes threaten to do to uranium enrichment operations).
This is why “reasoning” models were such a nasty shock for me. They showed that relatively inexpensive RL could upgrade existing models with very real new capabilities and the ability to handle multi-step tasks more robustly.
Some estimates claim that training Grok 3 cost $3 billion or more. If AI non-proliferation means preventing $30 billion or $300 billion training runs, that’s probably theoretically feasible (at least in a world where powerful people fear AGI badly enough). But if AI non-proliferation involves preventing $4,500 fine tunes by random researchers (like primitive “reasoning” apparently does), that’s a much stickier situation.
So, if like Yudkowsky, you have a nasty suspicion that “If anyone builds this, everyone dies” (seriously, go preorder his book[1]), then we need to consider that AGI might arrive via another route than Chinchilla scaling. And in that case, non-proliferation might be much harder than joint US/China treaties. I don’t have any good answers for this case. But I agree with OP that we need to include it as a branch in planning scenarios. And in those scenarios, mid-tier open weight models like Qwen are potentially significant, either as a base for fine-tuning in dangerous directions, or as evidence that some non-US labs making 32B parameter models are highly capable.
[1] https://www.lesswrong.com/posts/iNsy7MsbodCyNTwKs/eliezer-and-i-wrote-a-book-if-anyone-builds-it-everyone-dies