Total params plus the total KV cache for all requests multiplies the cost of output tokens, so there is reason to keep it down, but little reason to make it much smaller than the whole scale-up world, because then it’s much smaller than KV cache and stops influencing the cost. And for the most capable models the fraction of input tokens on OpenRouter is not as extreme as for Sonnet 4 (88% for Gemini 2.5 Pro, 92% for GPT-5; though 97% for Opus 4.1, probably due to high cost). So it won’t be a factor that motivates fewer active params as with the 8-chip servers and possibly in part with the 6-8 TB systems. Also, 2025 Google pretraining compute could be significantly greater than 100K H100s (maybe 2-4 100K TPUv6e datacenters, which have the same FLOP/s as 200-400K H100s; pretraining of models that are too large using TPUv6e is fine, just not inference or RLVR). So the compute optimal number of active params could increase to 1.0-1.5T (if my 120 tokens/param estimate is in the ballpark). This asks for at least 4-6T total params, but at least 8-12T for 1:8 sparsity might be more appropriate for a premium model (this would be Gemini 3 Ultra). Which is only 20% of the pod HBM (if in FP8), so maybe even 15-20T (at which point the contribution to the cost of output tokens becomes significant).
I’ve only recently realized that the reason there is no Gemini 2 Ultra might be because they don’t have enough inference capacity for overly large total params models, with TPUv6e only having 8 TB of HBM per pod and TPUv5p either outright insufficient in number or not enough to spare, since they are needed for other things. So it’s probably not evidence of Google having made a decision to use less than what they have, as I previously thought. And as TPUv7 changes what they have, they might use it to do more than what they did with Gemini 2. Though if the buildout for TPUv7 won’t yet be sufficiently finished in 2025, RLVR and inference will have to wait until later in 2026 (in the meantime, TPUv5p might help to start on RLVR).
Here’s a couple of my recent relevant posts (both slightly outdated, in particular see this comment, and the note on Gemini 2 Ultra in another comment under this quick take). Though in this quick take, I’m mostly discussing total params count and HBM capacity per scale-up world, not compute, how it’s constraining 2025 AIs beyond compute (so that even 2024 compute fails to find efficient use), and how in 2026 these constraints become less strict.