Vladimir_Nesov comments on No77e’s Shortform

Vladimir_Nesov 10 Aug 2025 14:11 UTC
9 points
0
Prefill (processing of input tokens) is efficient, something like 60% compute utilization might be possible, and that only depends on the number of active params. Generation of output tokens is HBM bandwidth bound, depends on the number of total params and the number of KV cache sequences for requests in a batch that fit on the same system (which share the cost of chip-time^[1]). With GB200 NVL72, batches could be huge, dividing the cost of output tokens (still probably several times more expensive per token than prefill).

For prefill, we can directly estimate at-cost inference from the capital cost of compute hardware, assuming a need to pay it back in 3 years (it will likely serve longer but become increasingly obsolete). An H100 system costs about $50K per chip ($5bn for a 100K H100s system). This is all-in for compute equipment, so with networking but without buildings and cooling, since those serve longer and don’t need to be paid back in 3 years. Operational costs are maybe below 20%, which gives $20K per year per chip, or $2.3 per H100-hour. On gpulist, there are many listings at $1.80 per H100-hour, so my methodology might be somewhat overestimating the bare bones cost.

For GB200 NVL72, which are still too scarce to get a visible market price anywhere close to at-cost, the all-in cost together with external networking in a large system is plausibly around $5M per 72-chip rack ($7bn for a 100K chip GB200 NVL72 system, $30bn for Stargate Abilene’s 400K chips in GB200/GB300 NVL72 racks). This is 70K capital cost per chip, or 27.7K per year that pay it back in 3 years with 20% operational costs. This is just $3.2 per chip-hour.

A 1T active param model consumes 2e18 FLOPs for 1M tokens. GB200 chips can do 5e15 FP8 FLOP/s or 10e15 FP4 FLOP/s. At $3.2 per chip-hour and 60% utilization (for prefill), this translates to $0.6 per million input tokens at FP8, or $0.3 per million input tokens at FP4. The API price for the batch mode of GPT-5 is $0.62 input, $5 output. So it might even be possible with FP8. And the 8T total params wouldn’t matter with GB200 NVL72, they fit with space to spare in just one rack/domain.

This is an at-cost estimate, in contrast to the cloud provider prices. Oracle is currently selling 4-chip instances from GB200 at $16 per chip-hour. But it’s barely on the market for now, so the prices don’t yet reflect costs. And for example GCP is still selling an H100-hour for $8 (a3-megagpu-8g instances). So for the major clouds, the price of GB200 might end up only coming down to $11 per chip-hour in 2026-2027, even though the bare bones at-cost price is only $3.2 per chip-hour (or a bit lower).
1. ↩︎
  I’m counting chips rather that GPUs to future-proof my terminology, since Huang recently proclaimed that starting with Rubin, compute dies will be considered GPUs (at March 2025 GTC, 1:28:04 into the keynote), so that a single chip will have 2 GPUs, and with Rubin Ultra a single chip will have 4 GPUs. It doesn’t help that Blackwell already has 2 compute dies per chip. This is sure to lead to confusion when counting things in GPUs, but counting in chips will remain less ambiguous.