anaguma comments on Vladimir_Nesov’s Shortform

anaguma 7 Sep 2025 21:10 UTC
3 points
0
If there are enough of these built by then, Gemini 3 could include the largest model of 2026 (by total params count).
What do you estimate the total params count would be if so?
- Vladimir_Nesov 7 Sep 2025 22:22 UTC
  6 points
  0
  Parent
  Total params plus the total KV cache for all requests multiplies the cost of output tokens, so there is reason to keep it down, but little reason to make it much smaller than the whole scale-up world, because then it’s much smaller than KV cache and stops influencing the cost. And for the most capable models the fraction of input tokens on OpenRouter is not as extreme as for Sonnet 4 (88% for Gemini 2.5 Pro, 92% for GPT-5; though 97% for Opus 4.1, probably due to high cost). So it won’t be a factor that motivates fewer active params as with the 8-chip servers and possibly in part with the 6-8 TB systems. Also, 2025 Google pretraining compute could be significantly greater than 100K H100s (maybe 2-4 100K TPUv6e datacenters, which have the same FLOP/s as 200-400K H100s; pretraining of models that are too large using TPUv6e is fine, just not inference or RLVR). So the compute optimal number of active params could increase to 1.0-1.5T (if my 120 tokens/param estimate is in the ballpark). This asks for at least 4-6T total params, but at least 8-12T for 1:8 sparsity might be more appropriate for a premium model (this would be Gemini 3 Ultra). Which is only 20% of the pod HBM (if in FP8), so maybe even 15-20T (at which point the contribution to the cost of output tokens becomes significant).
  
  I’ve only recently realized that the reason there is no Gemini 2 Ultra might be because they don’t have enough inference capacity for overly large total params models, with TPUv6e only having 8 TB of HBM per pod and TPUv5p either outright insufficient in number or not enough to spare, since they are needed for other things. So it’s probably not evidence of Google having made a decision to use less than what they have, as I previously thought. And as TPUv7 changes what they have, they might use it to do more than what they did with Gemini 2. Though if the buildout for TPUv7 won’t yet be sufficiently finished in 2025, RLVR and inference will have to wait until later in 2026 (in the meantime, TPUv5p might help to start on RLVR).
  What links here?
  - Vladimir_Nesov's comment on Vladimir_Nesov’s Shortform by Vladimir_Nesov (7 Sep 2025 23:01 UTC; 10 points)