Do we have reason to assume that MoE models will stretch to the entire rack memory capacity minus KV cache in batched inference? I ruminated earlier that Rubin Ultra could theoretically run 300T sparse models, but it looks recently like it’s possible to squeeze much more from far fewer parameters and iterate much faster.
This data about compute spend at OpenAI suggests this trend: https://epoch.ai/data-insights/openai-compute-spend
At the same time, some Chinese open source models are downscaling in parameters. The most recent MiniMax M2 is much smaller than MiniMax M1. Qwen-Next 80B-A3B is close in performance to Qwen-3 235B-A22B. I wouldn’t be surprised if the same happened with DeepSeek-V4.
It’s not that models won’t grow, but it feels like intelligence density per parameter and non-degrading sparsity thresholds will be tested in the search for maximum training and inference efficiency.
Thanks for the detailed response. If we assume that continued progress requires using much more compute on RL than pretraining, this could favor model sizes much smaller than Chinchilla optimal. It’s possible that future models will be trained with 99% RL and will employ some sort of continuous learning architecture.
I believe the number of active parameters will be determined by achieving similar load on compute and memory, which will depend on the current level of attention efficiency and the context necessary to hit agent ability targets.I don’t know how sparse very large (>20T) models could be, but they could probably be extremely sparse (maybe 0.1% or less, where most of the volume is taken by a router).
I think it’s possible that current and planned AI compute designs are not even close to optimal for training methods that may be developed within a year or so. Assuming something like DeepSeek DSA is relatively close to maximum efficiency, and RL pathways produced by agents will stretch to many millions of tokens, it may be the case that HBM will finally be too slow, and using SRAM will be the only way forward.
There’s also a completely different consideration. For example, inclusionAI/Ring-1T has 1T total and 50B active parameters, and the creators claim it can achieve silver in IMO using a multi-agent framework, with plans to train it further to achieve gold, it’s only 50b.
As you said, we’re probably entering compute ranges that enable training models with many T active parameters (assuming RL will be in roughly one-to-one proportion to pretraining). The question is whether pattern density in our environment is currently large enough to make use of it.”