Musings on Reported Cost of Compute (Oct 2025)
There are many ways in which costs of compute get reported. A 1 GW datacenter site costs $10-15bn in the infrastructure (buildings, cooling, power), plus $30-35bn in compute hardware (servers, networking, labor), assuming Nvidia GPUs. Useful life of the infrastructure is about 10-15 years, and with debt financing a developer only needs to ensure it’s paid off over those 10-15 years, which comes out at $1-2bn per year. For the compute hardware, the useful life is taken as about 5 years, which gives $6-7bn per year. Operational expenses (electricity, maintenance) are about $2.0-2.5bn per year.
In total, 1 GW of compute costs about $9-11bn per year, but whoever paid the compute hardware capex needs to ensure the payments continue for 5 years, so a contract for 1 GW of compute will be 5 years long, which makes it a single contract for at least $45-55bn, which might become $55-65bn to allow a profit for the cloud provider.
Thus 1 GW of compute could get reported as $10bn (infrastructure capex; or alternatively the compute costs for the AI company in a calendar year), as $30bn (compute hardware capex without labor costs), as $45bn (infrastructure plus compute hardware capex), or as $60bn (the total cost of the contract between the AI company and the cloud provider over 5 years).
A $300bn contract then might mean about 5 GW of total capacity, while a $27bn datacenter site could simultaneously mean 2 GW of total capacity. And a $300bn contract doesn’t mean that the 5 GW of capacity will be built immediately, for example if only 2 GW is built initially, that requires the AI company to be capable of paying about $25bn per year (for 5 years), with the other 3 GW being contingent on the AI company’s continuing growth.
Non-Nvidia Hardware
The Nvidia servers (1 GW all-in is about 5500 GB200 NVL72 servers with 72 chips each, or 400K chips in total) take up about $20bn of capex, so if Nvidia’s margin of about 70% applies to this part (it’s probably less, since GPUs are not all of the server), it comes out to $14bn per GW, or $2.8bn per year in a 5-year contract between an AI company and a cloud provider, about 25% of it. This suggests that non-Nvidia compute might only cost up to 25% less, all else equal (which it isn’t), even though GPUs are usually portrayed as the majority of the cost of compute, and Nvidia’s margin as the majority of the cost of the GPUs.
Thus a TPU contract for “tens of billions of dollars” and “over a gigawatt of capacity” admits an interpretation where it’s a ~5-year contract at ~$12bn per year for ~1.2 GW of compute (in total power, not just IT power).
Model Sizes in 2026
If the new contract is for TPUv7 Ironwood, in 2026 Anthropic will have 1 GW of compute with 49 TB of HBM per 256-chip pod. This is comparable to OpenAI’s Abilene site, which is 1 GW of compute with 14-20 TB of HBM per GB200/GB300 NVL72 rack, and will also be ready at this capacity in 2026. Currently Anthropic has access to Trainium 2 Ultra servers of AWS’s Project Rainier with 6 TB of HBM per rack, while OpenAI’s capacity is probably mostly in 8-chip Nvidia servers with 0.64-1.44 TB of HBM per server.
Feasible model size scales with HBM per server/rack/pod that serves as a scale-up world (especially for reasoning models and their training), so 2026 brings not just 10x more compute than 2024 (400K chips in GB200/GB300 NVL72 servers instead of 100K H100s), but also 10x larger models (in total parameter count). As the active params for compute optimal models scale with the square root of compute, this enables more sparsity in MoE models than 2024 compute did, and the number of active params for the larger models will no longer be constrained in practice by the 8-chip Nvidia servers (with 100K H100s, compute optimality asked for about 1T active params, which is almost too much for servers with 1 TB of HBM, and MoE models ask for multiple times more total params).
Do we have reason to assume that MoE models will stretch to the entire rack memory capacity minus KV cache in batched inference? I ruminated earlier that Rubin Ultra could theoretically run 300T sparse models, but it looks recently like it’s possible to squeeze much more from far fewer parameters and iterate much faster.
This data about compute spend at OpenAI suggests this trend: https://epoch.ai/data-insights/openai-compute-spend
At the same time, some Chinese open source models are downscaling in parameters. The most recent MiniMax M2 is much smaller than MiniMax M1. Qwen-Next 80B-A3B is close in performance to Qwen-3 235B-A22B. I wouldn’t be surprised if the same happened with DeepSeek-V4.
It’s not that models won’t grow, but it feels like intelligence density per parameter and non-degrading sparsity thresholds will be tested in the search for maximum training and inference efficiency.
Experiments on smaller models (and their important uses in production) will continue, reasons they should continue don’t affect the feasibility of there also being larger models. But currently there are reasons that larger models strain the feasibility of inference and RLVR, so they aren’t as good as they could be, and cost too much to use. Also, a lot of use seems to be in input tokens (Sonnet 4.5 via OpenRouter processes 98% of tokens as input tokens), so the unit economics of input tokens remains important, and that’s the number of active params, a reason to still try to keep them down even when they are far from being directly constrained by inference hardware or training compute.
Prefill (input tokens) and pretraining are mostly affected by the number of active params, adding more total params on top doesn’t make it worse (but improves model quality). For generation (decoding, output tokens) and RLVR, what matters is the time to pass total params and KV cache through compute dies (HBM in use divided by HBM bandwidth), as well as latency for passing through the model to get to the next token (which doesn’t matter for prefill). So you don’t want too many scale-up worlds to be involved, or else it would take too much additional time to move between them, and you don’t care too much if the total amount of data (total params plus KV cache) doesn’t change significantly. So if you are already using 1-2 scale-up worlds (8-chip servers for older Nvidia chips, 72-chip racks for GB200/GB300 NVL72, not-too-large pods for TPUs), and ~half of their HBM is KV cache, you don’t lose too much from filling more of the other half with total params.
It’s not a quantitative estimate, as the number of scale-up worlds and the fractions used up by KV cache and total params could vary, but when HBM per scale-up world goes up 10x, this suggests that total param counts might also go up 10x, all else equal. And the reason they are likely to actually go there is that even at 5e26 FLOPs (100K H100s, 150 MW), compute optimal number of active params is already about 1T. So if the modern models (other than GPT-4.5, Opus 4, and possibly Gemini 2.5 Pro) have less than 1T total params, they are being constrained by hardware in a way that’s not about the number of training FLOPs. If this constraint is lifted, the larger models are likely to make use of that.
For the Chinese models, the total-to-active ratio (sparsity) is already very high, but they don’t have enough compute to make good use of too many active params in pretraining. So we are observing this phenomenon of the number of total params filling the available HBM, despite the number of active params remaining low. With 1 GW datacenter sites, about 3T active params become compute optimal, so at least 1T active params will probably be in use for the larger models. Which asks for up to ~30T total params, so hardware will likely still be constraining them, but it won’t be constraining the 1T-3T active params themselves anymore.
Thanks for the detailed response. If we assume that continued progress requires using much more compute on RL than pretraining, this could favor model sizes much smaller than Chinchilla optimal. It’s possible that future models will be trained with 99% RL and will employ some sort of continuous learning architecture.
I believe the number of active parameters will be determined by achieving similar load on compute and memory, which will depend on the current level of attention efficiency and the context necessary to hit agent ability targets.I don’t know how sparse very large (>20T) models could be, but they could probably be extremely sparse (maybe 0.1% or less, where most of the volume is taken by a router).
I think it’s possible that current and planned AI compute designs are not even close to optimal for training methods that may be developed within a year or so. Assuming something like DeepSeek DSA is relatively close to maximum efficiency, and RL pathways produced by agents will stretch to many millions of tokens, it may be the case that HBM will finally be too slow, and using SRAM will be the only way forward.
There’s also a completely different consideration. For example, inclusionAI/Ring-1T has 1T total and 50B active parameters, and the creators claim it can achieve silver in IMO using a multi-agent framework, with plans to train it further to achieve gold, it’s only 50b.
As you said, we’re probably entering compute ranges that enable training models with many T active parameters (assuming RL will be in roughly one-to-one proportion to pretraining). The question is whether pattern density in our environment is currently large enough to make use of it.”
Compute optimality makes sense for RL as much as for pretraining. A model that’s too large won’t see much data, and a model that’s too small won’t be able to learn well even from the correspondingly larger amount of data. So it’s a quantitative question, where compute optimality for RLVR happens to be (compared to pretraining). The papers are still hashing out stable/scalable RL training, rather than the tradeoffs of RL training constrained by fixed total compute (under varying model size specifically).
I don’t understand how energy is still an appropriate unit for measuring compute capacity when there are two different chip paradigms. Do Nvidia cards and Ironwood TPU’s give the exact same performance for the same energy input? What exactly are the differences in capacity to train/deploy models between the 1 GW capacity Anthropic will have and the 1GW OpenAI will have? I looked into this a bit and it seems like TPU’s are explicitly designed for inference only, is that accurate? I feel like compiling this kind of information somewhere would be a good idea since its all rather opaque, technical, and obfuscated by press releases that seek to push a “look at our awesome 11 figure chip deal” narrative rather than provide actual transparency about capacity.
The Anthropic announcement says “up to one million TPUs”, and the Ironwood announcement claims 4.6e15 FP8 FLOP/s per chip. A 2-die GB200 chip produces 5e15 dense FP8 FLOP/s, and there are about 400K chips in the 1 GW phase of the Abilene system.
Thus if the Anthropic contract is for TPUv7 Ironwood, their 1 GW system will have about 2x the FLOP/s of the Abilene 1 GW system (probably because Ironwood is 3nm, while Blackwell is 4nm, which is a minor refinement of 5nm). Though it’s not clear that the Anthropic contract is for one system, unlike the case with Abilene, that is datacenters with sufficient bandwidth between them. But Google had a lot of time to set up inter-datacenter networking, so this is plausible even for collections of somewhat distant datacenter buildings. If this isn’t the case, then it’s only good for RLVR and inference, not for the largest pretraining runs.
The reason things like this could happen is that OpenAI needed to give the go-ahead for the Abilene system in 2024, when securing a 1 GW Ironwood system from Google plausibly wasn’t in the cards, and in any case they wouldn’t want to depend on Google too much, because GDM is a competitor (and the Microsoft relationship was already souring). On the other hand, Anthropic still has enough AWS backing to make some dependence on Google less crucial, and they only needed to learn recently about the feasibility of a 1 GW system from Google. Perhaps OpenAI will be getting a 1-2 GW system from Google as well at some point, but then Nvidia Rubin (not to mention Rubin Ultra) is not necessarily worse than Google’s next thing.
I think it’s a fair assumption that they are close enough. If they weren’t, why on Earth would someone still be using whichever happened to be the vastly more inefficient option?
Because it’s what they can get. A factor of two or more in compute is plausibly less important than a delay of a year.
This may or may not be the case, but the argument for why it can’t be very different fails.
Well, within reason that can happen—I am not saying the metric is going to be perfect. But it’s probably a decent first order approximation because that logic can’t stretch forever. If instead of a factor of 2 it was a factor of 10 the trade off would probably not be worth it.
Data. Find out the answer.
https://www.wevolver.com/article/tpu-vs-gpu-a-comprehensive-technical-comparison
Looks like they arehwitin 2x of the H200s, albeit with some complexity in details.
Thanks! I guess my original statement came off a bit too strong, but what I meant is that while there is a frontier for trade offs (maybe the GPUs’ greater flexibility is worth the 2x energy cost?), I didn’t expect the gap to be orders of magnitude. That’s good enough for me with the understanding that any such estimates will never be particularly accurate anyway and just give us a rough idea of how much compute these companies are actually fielding. What they squeeze out of that will depend on a bunch of other details anyway, so scale is the best we can guess.