Do we have reason to assume that MoE models will stretch to the entire rack memory capacity minus KV cache in batched inference? I ruminated earlier that Rubin Ultra could theoretically run 300T sparse models, but it looks recently like it’s possible to squeeze much more from far fewer parameters and iterate much faster.
At the same time, some Chinese open source models are downscaling in parameters. The most recent MiniMax M2 is much smaller than MiniMax M1. Qwen-Next 80B-A3B is close in performance to Qwen-3 235B-A22B. I wouldn’t be surprised if the same happened with DeepSeek-V4.
It’s not that models won’t grow, but it feels like intelligence density per parameter and non-degrading sparsity thresholds will be tested in the search for maximum training and inference efficiency.
Experiments on smaller models (and their important uses in production) will continue, reasons they should continue don’t affect the feasibility of there also being larger models. But currently there are reasons that larger models strain the feasibility of inference and RLVR, so they aren’t as good as they could be, and cost too much to use. Also, a lot of use seems to be in input tokens (Sonnet 4.5 via OpenRouter processes 98% of tokens as input tokens), so the unit economics of input tokens remains important, and that’s the number of active params, a reason to still try to keep them down even when they are far from being directly constrained by inference hardware or training compute.
Prefill (input tokens) and pretraining are mostly affected by the number of active params, adding more total params on top doesn’t make it worse (but improves model quality). For generation (decoding, output tokens) and RLVR, what matters is the time to pass total params and KV cache through compute dies (HBM in use divided by HBM bandwidth), as well as latency for passing through the model to get to the next token (which doesn’t matter for prefill). So you don’t want too many scale-up worlds to be involved, or else it would take too much additional time to move between them, and you don’t care too much if the total amount of data (total params plus KV cache) doesn’t change significantly. So if you are already using 1-2 scale-up worlds (8-chip servers for older Nvidia chips, 72-chip racks for GB200/GB300 NVL72, not-too-large pods for TPUs), and ~half of their HBM is KV cache, you don’t lose too much from filling more of the other half with total params.
It’s not a quantitative estimate, as the number of scale-up worlds and the fractions used up by KV cache and total params could vary, but when HBM per scale-up world goes up 10x, this suggests that total param counts might also go up 10x, all else equal. And the reason they are likely to actually go there is that even at 5e26 FLOPs (100K H100s, 150 MW), compute optimal number of active params is already about 1T. So if the modern models (other than GPT-4.5, Opus 4, and possibly Gemini 2.5 Pro) have less than 1T total params, they are being constrained by hardware in a way that’s not about the number of training FLOPs. If this constraint is lifted, the larger models are likely to make use of that.
For the Chinese models, the total-to-active ratio (sparsity) is already very high, but they don’t have enough compute to make good use of too many active params in pretraining. So we are observing this phenomenon of the number of total params filling the available HBM, despite the number of active params remaining low. With 1 GW datacenter sites, about 3T active params become compute optimal, so at least 1T active params will probably be in use for the larger models. Which asks for up to ~30T total params, so hardware will likely still be constraining them, but it won’t be constraining the 1T-3T active params themselves anymore.
Thanks for the detailed response. If we assume that continued progress requires using much more compute on RL than pretraining, this could favor model sizes much smaller than Chinchilla optimal. It’s possible that future models will be trained with 99% RL and will employ some sort of continuous learning architecture.
I believe the number of active parameters will be determined by achieving similar load on compute and memory, which will depend on the current level of attention efficiency and the context necessary to hit agent ability targets.I don’t know how sparse very large (>20T) models could be, but they could probably be extremely sparse (maybe 0.1% or less, where most of the volume is taken by a router).
I think it’s possible that current and planned AI compute designs are not even close to optimal for training methods that may be developed within a year or so. Assuming something like DeepSeek DSA is relatively close to maximum efficiency, and RL pathways produced by agents will stretch to many millions of tokens, it may be the case that HBM will finally be too slow, and using SRAM will be the only way forward.
There’s also a completely different consideration. For example, inclusionAI/Ring-1T has 1T total and 50B active parameters, and the creators claim it can achieve silver in IMO using a multi-agent framework, with plans to train it further to achieve gold, it’s only 50b.
As you said, we’re probably entering compute ranges that enable training models with many T active parameters (assuming RL will be in roughly one-to-one proportion to pretraining). The question is whether pattern density in our environment is currently large enough to make use of it.”
Compute optimality makes sense for RL as much as for pretraining. A model that’s too large won’t see much data, and a model that’s too small won’t be able to learn well even from the correspondingly larger amount of data. So it’s a quantitative question, where compute optimality for RLVR happens to be (compared to pretraining). The papers are still hashing out stable/scalable RL training, rather than the tradeoffs of RL training constrained by fixed total compute (under varying model size specifically).
Do we have reason to assume that MoE models will stretch to the entire rack memory capacity minus KV cache in batched inference? I ruminated earlier that Rubin Ultra could theoretically run 300T sparse models, but it looks recently like it’s possible to squeeze much more from far fewer parameters and iterate much faster.
This data about compute spend at OpenAI suggests this trend: https://epoch.ai/data-insights/openai-compute-spend
At the same time, some Chinese open source models are downscaling in parameters. The most recent MiniMax M2 is much smaller than MiniMax M1. Qwen-Next 80B-A3B is close in performance to Qwen-3 235B-A22B. I wouldn’t be surprised if the same happened with DeepSeek-V4.
It’s not that models won’t grow, but it feels like intelligence density per parameter and non-degrading sparsity thresholds will be tested in the search for maximum training and inference efficiency.
Experiments on smaller models (and their important uses in production) will continue, reasons they should continue don’t affect the feasibility of there also being larger models. But currently there are reasons that larger models strain the feasibility of inference and RLVR, so they aren’t as good as they could be, and cost too much to use. Also, a lot of use seems to be in input tokens (Sonnet 4.5 via OpenRouter processes 98% of tokens as input tokens), so the unit economics of input tokens remains important, and that’s the number of active params, a reason to still try to keep them down even when they are far from being directly constrained by inference hardware or training compute.
Prefill (input tokens) and pretraining are mostly affected by the number of active params, adding more total params on top doesn’t make it worse (but improves model quality). For generation (decoding, output tokens) and RLVR, what matters is the time to pass total params and KV cache through compute dies (HBM in use divided by HBM bandwidth), as well as latency for passing through the model to get to the next token (which doesn’t matter for prefill). So you don’t want too many scale-up worlds to be involved, or else it would take too much additional time to move between them, and you don’t care too much if the total amount of data (total params plus KV cache) doesn’t change significantly. So if you are already using 1-2 scale-up worlds (8-chip servers for older Nvidia chips, 72-chip racks for GB200/GB300 NVL72, not-too-large pods for TPUs), and ~half of their HBM is KV cache, you don’t lose too much from filling more of the other half with total params.
It’s not a quantitative estimate, as the number of scale-up worlds and the fractions used up by KV cache and total params could vary, but when HBM per scale-up world goes up 10x, this suggests that total param counts might also go up 10x, all else equal. And the reason they are likely to actually go there is that even at 5e26 FLOPs (100K H100s, 150 MW), compute optimal number of active params is already about 1T. So if the modern models (other than GPT-4.5, Opus 4, and possibly Gemini 2.5 Pro) have less than 1T total params, they are being constrained by hardware in a way that’s not about the number of training FLOPs. If this constraint is lifted, the larger models are likely to make use of that.
For the Chinese models, the total-to-active ratio (sparsity) is already very high, but they don’t have enough compute to make good use of too many active params in pretraining. So we are observing this phenomenon of the number of total params filling the available HBM, despite the number of active params remaining low. With 1 GW datacenter sites, about 3T active params become compute optimal, so at least 1T active params will probably be in use for the larger models. Which asks for up to ~30T total params, so hardware will likely still be constraining them, but it won’t be constraining the 1T-3T active params themselves anymore.
Thanks for the detailed response. If we assume that continued progress requires using much more compute on RL than pretraining, this could favor model sizes much smaller than Chinchilla optimal. It’s possible that future models will be trained with 99% RL and will employ some sort of continuous learning architecture.
I believe the number of active parameters will be determined by achieving similar load on compute and memory, which will depend on the current level of attention efficiency and the context necessary to hit agent ability targets.I don’t know how sparse very large (>20T) models could be, but they could probably be extremely sparse (maybe 0.1% or less, where most of the volume is taken by a router).
I think it’s possible that current and planned AI compute designs are not even close to optimal for training methods that may be developed within a year or so. Assuming something like DeepSeek DSA is relatively close to maximum efficiency, and RL pathways produced by agents will stretch to many millions of tokens, it may be the case that HBM will finally be too slow, and using SRAM will be the only way forward.
There’s also a completely different consideration. For example, inclusionAI/Ring-1T has 1T total and 50B active parameters, and the creators claim it can achieve silver in IMO using a multi-agent framework, with plans to train it further to achieve gold, it’s only 50b.
As you said, we’re probably entering compute ranges that enable training models with many T active parameters (assuming RL will be in roughly one-to-one proportion to pretraining). The question is whether pattern density in our environment is currently large enough to make use of it.”
Compute optimality makes sense for RL as much as for pretraining. A model that’s too large won’t see much data, and a model that’s too small won’t be able to learn well even from the correspondingly larger amount of data. So it’s a quantitative question, where compute optimality for RLVR happens to be (compared to pretraining). The papers are still hashing out stable/scalable RL training, rather than the tradeoffs of RL training constrained by fixed total compute (under varying model size specifically).