Flagship models need inference compute at gigawatt scale with a lot of HBM per scale-up world. Nvidia’s systems are currently a year behind for serving models with trillions of total params, and will remain behind until 2028-2029 for serving models with tens of trillions of total params. Thus if OpenAI fails to access TPUs or some other alternative to Nvidia (at gigawatt scale), it will continue being unable to serve a model with a competitive amount of total params as a flagship model until late 2028 to 2029. There will be a window in 2026 when OpenAI catches up, but then it’s behind again.
The current largest flagship models are Gemini 3 Pro and Opus 4.5, probably at multiple trillions of total params, requiring systems with multiple TB of HBM per scale-up world to serve efficiently. They are likely using Trillium (TPUv6e, 8 TB per scale-up world) and Trainium 2 Ultra (6 TB per scale-up world), and need north of high hundreds of megawatts of such systems to serve their user bases.
Nvidia’s system in this class is GB200/GB300 NVL72 (14/20 TB per scale-up world), but so far there isn’t enough of it built, and so models served with Nvidia’s older hardware (H100/H200/B200, 0.6-1.4 TB per 8-chip scale-up world) either have to remain smaller or become more expensive. The smaller amount of NVL72s that are currently in operation can only serve large models to a smaller user base. As a result, OpenAI will probably have to keep the smaller GPT-5 as the flagship model until they and Azure build enough NVL72s, which will happen somewhere in mid to late 2026 (the bigger model will very likely get released much earlier than that, perhaps even imminently, but will have to remain heavily restricted by either price or rate limits). Paradoxically, xAI might be in a better position as a result of having fewer users, and so they might be able to serve their 6T total param Grok 5 starting early 2026 at a reasonable price.
But then in 2026, there is a gigawatt scale buildout of Ironwood (TPUv7, 50 TB per scale-up world for the smallest 256-chip pods, more with slices of 9216-chip pods), available to both GDM and Anthropic. This suggests the possibility of flagship models (Gemini 4, Opus 5) with tens of trillions of total params at the end of 2026 (maybe start of 2027), 10x larger than what we have today, something even Nvidia’s NVL72 systems (either Blackwell or Rubin) won’t be as suited to cope with. Nvidia’s answer to Ironwood is Rubin Ultra NVL576 (150 TB of HBM per scale-up world), but it’s only out in 2027, which means there won’t be enough of it built until at least late 2028, plausibly 2029 (compare to GB200 NVL72 being out in 2024, with gigawatt scale systems only built in 2026). So if OpenAI/xAI/Meta want to serve a flagship model with tens of trillions of total params in 2026-2028, they need access to enough TPUs, or some other alternative systems (which seems less likely given the short notice).
Paradoxically, xAI might be in a better position as a result of having fewer users, and so they might be able to serve their 6T total param Grok 5 starting early 2026 at a reasonable price.
If compute used for RL is comparable to compute used for inference for GDM and Anthropic, then serving to users might not be that important of a dimension. I guess it could be acceptable to have much slower inference for RL but not for serving to users.
Once a model is trained, it needs to be served. If both xAI and OpenAI have their ~6T total param models already trained in Jan 2026, xAI will have enough NVL72 systems to serve the model to all its users at a reasonable speed and price (and so they will), while OpenAI won’t have that option at all (without restricting demand somehow, probably with rate limits or higher prices).
A meaningful fraction of the buildout of a given system is usually online several months to a year before the bulk of it, that’s when the first news about cloud access appear. If new inference hardware makes more total params practical than previous hardware, and scaling of hardware amount (between hardware generations) is still underway, then even a fraction of the new hardware buildout will be comparable to the bulk of the old hardware (which is busy anyway) in FLOPs and training steps that RL can get out of it, adjusting for better efficiency. And slow rollouts for RL (on old hardware) increase batch sizes and decrease the total number of training steps that fit into a few months, this could also be important.
serving to users might not be that important of a dimension
What new hardware becomes available before the bulk of it is not uniquely useful for serving to users, because there isn’t enough of it to serve a flagship model with more total params to all users. So it does seem to make sense to use it for RL training of a new model that will use the bulk of the same hardware for serving the model to users once enough of it is built.
I don’t understand why HBM per scale-up world is a major constraint for inference. For Deepseek V3, “The minimum deployment unit of the decoding stage consists of 40 nodes with 320 GPUs.” (section 3.4.2 https://arxiv.org/pdf/2412.19437v1). This seems like evidence that you can get reasonably efficient inference out of multi-node setups.
If using H800s, this means that their minimum deployment unit has 25.6 TB of memory. In general it seems like there are probably a lot of engineering tricks you can do to get efficient inference out of multi-node setups.
I don’t think there is a delay specific to NVL72, it just takes this long normally, and with all the external customers Nvidia needs to announce things a bit earlier than, say, Google. This is why I expect Rubin Ultra NVL576 (the next check on TPU dominance after 2026′s NVL72) to also take similarly long. It’s announced for 2027, but 2028 will probably only see completion of a fraction of the eventual buildout, and only in 2029 will the bulk of the buildout be completed (though maybe late 2028 will be made possible for NVL576 specifically, given the urgency and time to prepare). This would enable companies like OpenAI (without access to TPUs at gigawatt scale) to serve flagship models at the next level of scale (what 2026 pretraining compute asks for) for all its users, catching up to where Google and Anthropic were in 2026-2027 thanks to Ironwood. Unless Google decides to give yet another of its competitors this crucial resource and allows OpenAI to build gigawatts of TPUs earlier than 2028-2029.
Do you know why it takes such a long time to deploy a new rack system at scale? In my mind you slap on the new Rubin chips, more HBM, and you are good to go. (In your linked comment you mention “reliability issues”, is that where the bulk of the time comes from? (I did not read the linked semianalysis article.)) Or does everything, including e.g. cooling and interconnects, have to be redesigned from scratch for each new rack system, so you can’t reuse any of the older proven/reliable components?
That things other than chips need to be redesigned wouldn’t argue either way, because in that hypothetical everything could just come together at once, the other things the same way as the chips themselves. The issue is capacity of factories and labor for all the stuff and integration and construction. You can’t produce everything all at once, instead you need to produce each kind of thing that goes into the finished datacenters over the course of at least months, maybe as long as 2 years for sufficiently similar variants of a system that can share many steps of the process (as with H100/H200/B200 previously, and now GB200/GB300 NVL72).
How elaborate the production process needs to be also doesn’t matter, it just shifts the arrival of the finished systems in time (even if substantially), with the first systems still getting ready earlier than the bulk of them. And so the first 20% of everything (at a given stage of production) will be ready partway into the volume production period (in a broad sense that also includes construction of datacenter buildings or burn-in of racks), significantly earlier than most of it.
Flagship models need inference compute at gigawatt scale with a lot of HBM per scale-up world. Nvidia’s systems are currently a year behind for serving models with trillions of total params, and will remain behind until 2028-2029 for serving models with tens of trillions of total params. Thus if OpenAI fails to access TPUs or some other alternative to Nvidia (at gigawatt scale), it will continue being unable to serve a model with a competitive amount of total params as a flagship model until late 2028 to 2029. There will be a window in 2026 when OpenAI catches up, but then it’s behind again.
The current largest flagship models are Gemini 3 Pro and Opus 4.5, probably at multiple trillions of total params, requiring systems with multiple TB of HBM per scale-up world to serve efficiently. They are likely using Trillium (TPUv6e, 8 TB per scale-up world) and Trainium 2 Ultra (6 TB per scale-up world), and need north of high hundreds of megawatts of such systems to serve their user bases.
Nvidia’s system in this class is GB200/GB300 NVL72 (14/20 TB per scale-up world), but so far there isn’t enough of it built, and so models served with Nvidia’s older hardware (H100/H200/B200, 0.6-1.4 TB per 8-chip scale-up world) either have to remain smaller or become more expensive. The smaller amount of NVL72s that are currently in operation can only serve large models to a smaller user base. As a result, OpenAI will probably have to keep the smaller GPT-5 as the flagship model until they and Azure build enough NVL72s, which will happen somewhere in mid to late 2026 (the bigger model will very likely get released much earlier than that, perhaps even imminently, but will have to remain heavily restricted by either price or rate limits). Paradoxically, xAI might be in a better position as a result of having fewer users, and so they might be able to serve their 6T total param Grok 5 starting early 2026 at a reasonable price.
But then in 2026, there is a gigawatt scale buildout of Ironwood (TPUv7, 50 TB per scale-up world for the smallest 256-chip pods, more with slices of 9216-chip pods), available to both GDM and Anthropic. This suggests the possibility of flagship models (Gemini 4, Opus 5) with tens of trillions of total params at the end of 2026 (maybe start of 2027), 10x larger than what we have today, something even Nvidia’s NVL72 systems (either Blackwell or Rubin) won’t be as suited to cope with. Nvidia’s answer to Ironwood is Rubin Ultra NVL576 (150 TB of HBM per scale-up world), but it’s only out in 2027, which means there won’t be enough of it built until at least late 2028, plausibly 2029 (compare to GB200 NVL72 being out in 2024, with gigawatt scale systems only built in 2026). So if OpenAI/xAI/Meta want to serve a flagship model with tens of trillions of total params in 2026-2028, they need access to enough TPUs, or some other alternative systems (which seems less likely given the short notice).
If compute used for RL is comparable to compute used for inference for GDM and Anthropic, then serving to users might not be that important of a dimension. I guess it could be acceptable to have much slower inference for RL but not for serving to users.
Once a model is trained, it needs to be served. If both xAI and OpenAI have their ~6T total param models already trained in Jan 2026, xAI will have enough NVL72 systems to serve the model to all its users at a reasonable speed and price (and so they will), while OpenAI won’t have that option at all (without restricting demand somehow, probably with rate limits or higher prices).
A meaningful fraction of the buildout of a given system is usually online several months to a year before the bulk of it, that’s when the first news about cloud access appear. If new inference hardware makes more total params practical than previous hardware, and scaling of hardware amount (between hardware generations) is still underway, then even a fraction of the new hardware buildout will be comparable to the bulk of the old hardware (which is busy anyway) in FLOPs and training steps that RL can get out of it, adjusting for better efficiency. And slow rollouts for RL (on old hardware) increase batch sizes and decrease the total number of training steps that fit into a few months, this could also be important.
What new hardware becomes available before the bulk of it is not uniquely useful for serving to users, because there isn’t enough of it to serve a flagship model with more total params to all users. So it does seem to make sense to use it for RL training of a new model that will use the bulk of the same hardware for serving the model to users once enough of it is built.
Any source for Gemini 3 Pro and Opus 4.5 being multiple trillion? Just intuitively from the serving speed of Opus 4.5 seems dubious.
(not a reliable source, just a fun botec)
I don’t understand why HBM per scale-up world is a major constraint for inference. For Deepseek V3, “The minimum deployment unit of the decoding stage consists of 40 nodes with 320 GPUs.” (section 3.4.2 https://arxiv.org/pdf/2412.19437v1). This seems like evidence that you can get reasonably efficient inference out of multi-node setups.
If using H800s, this means that their minimum deployment unit has 25.6 TB of memory. In general it seems like there are probably a lot of engineering tricks you can do to get efficient inference out of multi-node setups.
Do you know the reason for the NVL72 delay? I thought it was announced in March 2024.
I don’t think there is a delay specific to NVL72, it just takes this long normally, and with all the external customers Nvidia needs to announce things a bit earlier than, say, Google. This is why I expect Rubin Ultra NVL576 (the next check on TPU dominance after 2026′s NVL72) to also take similarly long. It’s announced for 2027, but 2028 will probably only see completion of a fraction of the eventual buildout, and only in 2029 will the bulk of the buildout be completed (though maybe late 2028 will be made possible for NVL576 specifically, given the urgency and time to prepare). This would enable companies like OpenAI (without access to TPUs at gigawatt scale) to serve flagship models at the next level of scale (what 2026 pretraining compute asks for) for all its users, catching up to where Google and Anthropic were in 2026-2027 thanks to Ironwood. Unless Google decides to give yet another of its competitors this crucial resource and allows OpenAI to build gigawatts of TPUs earlier than 2028-2029.
Do you know why it takes such a long time to deploy a new rack system at scale? In my mind you slap on the new Rubin chips, more HBM, and you are good to go. (In your linked comment you mention “reliability issues”, is that where the bulk of the time comes from? (I did not read the linked semianalysis article.)) Or does everything, including e.g. cooling and interconnects, have to be redesigned from scratch for each new rack system, so you can’t reuse any of the older proven/reliable components?
That things other than chips need to be redesigned wouldn’t argue either way, because in that hypothetical everything could just come together at once, the other things the same way as the chips themselves. The issue is capacity of factories and labor for all the stuff and integration and construction. You can’t produce everything all at once, instead you need to produce each kind of thing that goes into the finished datacenters over the course of at least months, maybe as long as 2 years for sufficiently similar variants of a system that can share many steps of the process (as with H100/H200/B200 previously, and now GB200/GB300 NVL72).
How elaborate the production process needs to be also doesn’t matter, it just shifts the arrival of the finished systems in time (even if substantially), with the first systems still getting ready earlier than the bulk of them. And so the first 20% of everything (at a given stage of production) will be ready partway into the volume production period (in a broad sense that also includes construction of datacenter buildings or burn-in of racks), significantly earlier than most of it.