Vladimir_Nesov comments on Vladimir_Nesov’s Shortform

Vladimir_Nesov 7 Sep 2025 20:11 UTC
38 points
0
Crusoe/OpenAI Abilene campus might come online in Feb-Jun 2026. Crusoe CEO said during RAISE Summit 2025 (that took place on 8-9 Jul 2025) that the 6 buildings of phase 2 will “be coming online” in “just over 200 days” (at 7:03 during a panel discussion). If this means 230 days, that’s end of Feb 2026. If he really means “coming online”, then it becomes available at that time. If he actually means that it’s when the last building of 8 from both phases will be ready to install the compute hardware, then it’s at least 3-4 months more to do that (judging by xAI’s Colossus), possibly May-Jun 2026.

This is plausibly the first 400K chip system in GB200/GB300 NVL72 racks (about 900 MW), which is 10x 100K H100s of 2024 in FLOP/s and 12x H200s in HBM per scale-up world (for GB200, at 14 TB), making models 10x larger in total params feasible to inference or train with a lot of RLVR. Currently only Google plausibly has comparable compute, with their Trillium (TPUv6e) systems that across 256 chips per pod (scale-up world) offer 8 TB of HBM (generally available since Dec 2024 in 100K chip systems). The older TPUv5p from 2023 has even larger pods, but it’s unclear if they have enough of them to for example inference Gemini 2.5 Pro for all users. And Anthropic has Trainium 2 Ultra systems with 6 TB of HBM. Currently they probably only have 400K chips that only became available recently (months after TPUv6e), but by next year they might get significantly more.

2025 Frontier Model Sizes

This weakly predicts that GPT-5-thinking (and Grok 4) is a smaller model (1-2T total params) running on older hardware (~H200s, 1.1 TB), Gemini 2.5 Pro might be a 3-5T total params model (TPUv6e, 8 TB), and Opus 4 might be a 2-4T total params model (Trainium 2 Ultra, 6 TB). I’m assuming that the recent frontier models targeting the older 8-chip servers had to be too big to fit in one scale-up world to capture at least some capabilities that the available pretraining compute in principle enables, but the constraint is no longer as onerous with the newer systems, and so they will likely just fit in one scale-up world rather than lose efficiency on needing more.

The compute optimal size for pretraining with 100K H100s of 2024 might be about 800B active params (at 120 tokens/param, 3x the dense model’s 40 tokens/param to account for 1:8 sparsity), which is probably way too much with 1 TB HBM per server (since MoE wants at least 4x more total params, and inference gets slower and more expensive if too many scale-up worlds are needed per model), but might be OK for 6-8 TB of HBM per scale-up world, and so Opus 4 and Gemini 2.5 Pro might also have more active params than GPT-5-thinking. With GB200 NVL72 (14 TB), models with 4-8T total params become feasible, so there is less reason to keep the number of active params below compute optimal level. And then GB300 NVL72 has 20 TB of HBM, which is plausibly what the remaining 6 buildings of phase 2 of Abilene campus will host.

On the other hand, most tokens are input tokens (98% of OpenRouter Sonnet 4 tokens are input tokens), so reducing the number of active params is very important for model providers, and even if Gemini 2.5 Pro has 5T total params, it might still have significantly less than the pretraining compute optimal ~800B params. For example, at 1:32 sparsity even 5T total params only ask for 160B active params.

Largest Models of 2025-2026

So only Opus 4 is somewhat likely to have a compute optimal number of active params, due to its very high price and contrast with the already capable Sonnet 4 (they might’ve only had access to about 50K H100s when pretraining Opus 4, which is 5x fewer FLOP/s than 400K Trainium 2 chips). And GPT-4.5 probably has a similar number of active params (plausibly a bit more, since they had at least 100K H100s), but we still didn’t get a thinking version, so its capabilities can’t be properly observed. And plausibly it wasn’t trained with enough RLVR to count due to lack of availability of GB200 NVL72. By now, Opus 4.1 plausibly had enough time with Trainium 2 Ultra available to train with pretraining-scale RLVR (or this might happen a bit later), and similarly for GPT-4.5 (with GB200 NVL72), but for GPT-4.5 there might be insufficient compute to inference it without reducing demand a lot by setting uncomfortable prices or rate limits, and as a result of that a thinking model with pretraining-scale RLVR might not exist yet, at least in a product-ready form. This might take until well into 2026 to change, after phase 2 of the Abilene campus is ready (and presumably buildouts by other cloud providers that OpenAI might use, which might be a bit earlier, since inference doesn’t have much use for particularly giant datacenter campuses, just enough in total to serve all users). If so, this is when we’ll see the first GPT-4.5 sized pretraining-scale RLVR trained model from OpenAI, though by that time the plausibly similarly sized Opus 4 would already be considerably more mature.

Then, there is Gemini 3, which will probably come out early 2026. The next generation TPU is Ironwood (TPUv7), which supports 9,216 chip pods, but even 256 chip pods have 50 TB of HBM per pod. If there are enough of these built by then, Gemini 3 could include the largest model of 2026 (by total params count).
- kairos_ 7 Sep 2025 22:42 UTC
  5 points
  0
  Parent
  A post going over how much compute each frontier AI lab has will likely be very helpful.
  - Vladimir_Nesov 7 Sep 2025 23:01 UTC
    10 points
    0
    Parent
    Here’s a couple of my recent relevant posts (both slightly outdated, in particular see this comment, and the note on Gemini 2 Ultra in another comment under this quick take). Though in this quick take, I’m mostly discussing total params count and HBM capacity per scale-up world, not compute, how it’s constraining 2025 AIs beyond compute (so that even 2024 compute fails to find efficient use), and how in 2026 these constraints become less strict.
- anaguma 7 Sep 2025 21:10 UTC
  3 points
  0
  Parent
  If there are enough of these built by then, Gemini 3 could include the largest model of 2026 (by total params count).
  What do you estimate the total params count would be if so?
  - Vladimir_Nesov 7 Sep 2025 22:22 UTC
    6 points
    0
    Parent
    Total params plus the total KV cache for all requests multiplies the cost of output tokens, so there is reason to keep it down, but little reason to make it much smaller than the whole scale-up world, because then it’s much smaller than KV cache and stops influencing the cost. And for the most capable models the fraction of input tokens on OpenRouter is not as extreme as for Sonnet 4 (88% for Gemini 2.5 Pro, 92% for GPT-5; though 97% for Opus 4.1, probably due to high cost). So it won’t be a factor that motivates fewer active params as with the 8-chip servers and possibly in part with the 6-8 TB systems. Also, 2025 Google pretraining compute could be significantly greater than 100K H100s (maybe 2-4 100K TPUv6e datacenters, which have the same FLOP/s as 200-400K H100s; pretraining of models that are too large using TPUv6e is fine, just not inference or RLVR). So the compute optimal number of active params could increase to 1.0-1.5T (if my 120 tokens/param estimate is in the ballpark). This asks for at least 4-6T total params, but at least 8-12T for 1:8 sparsity might be more appropriate for a premium model (this would be Gemini 3 Ultra). Which is only 20% of the pod HBM (if in FP8), so maybe even 15-20T (at which point the contribution to the cost of output tokens becomes significant).
    
    I’ve only recently realized that the reason there is no Gemini 2 Ultra might be because they don’t have enough inference capacity for overly large total params models, with TPUv6e only having 8 TB of HBM per pod and TPUv5p either outright insufficient in number or not enough to spare, since they are needed for other things. So it’s probably not evidence of Google having made a decision to use less than what they have, as I previously thought. And as TPUv7 changes what they have, they might use it to do more than what they did with Gemini 2. Though if the buildout for TPUv7 won’t yet be sufficiently finished in 2025, RLVR and inference will have to wait until later in 2026 (in the meantime, TPUv5p might help to start on RLVR).
    What links here?
    Vladimir_Nesov's comment on Vladimir_Nesov’s Shortform by Vladimir_Nesov (7 Sep 2025 23:01 UTC; 10 points)

Vladimir_Nesov comments on Vladimir_Nesov’s Shortform

2025 Frontier Model Sizes

Largest Models of 2025-2026