For example, it seems too fast to be larger than GPT-4.5.
A “GPT-5” named according to the previous convention in terms of pretraining compute would need at least 1e27 FLOPs (50x original GPT-4), which on H100/H200 can at best be done in FP8. Which could be done with 150K H100s for 3 months at 40% utilization. (GB200 NVL72 is too recent to use for this pretraining run, though there is a remote possiblity of B200.) A compute optimal shape for this model would be something like 8T total params, 1T active[1].
The speed of GPT-5 could be explained by using GB200 NVL72 for inference, even if it’s an 8T total param model. GPT-4.5 was slow and expensive likely because it needed many older 8-chip servers (which have 0.64-1.44 TB of HBM) to keep in HBM with room for KV caches, but a single GB200 NVL72 has 14 TB of HBM. At the same time, it wouldn’t help as much with the speed of smaller models (but it would help with their output token cost because you can fit more KV cache in the same NVLink domain, which isn’t necessarily yet being reflected in prices, since GB200 NVL72 is still scarce). So it remains somewhat plausible that GPT-5 is essentially GPT-4.5-thinking running on better hardware.
Its performance (quality, not speed) though suggests that it might well be a smaller model with pretraining scale RLVR (possibly like Grok 4, just done better). Also, doing RLVR on an 8T total param model using the older 8-chip servers would be slow/inefficient, and GB200 NVL72 might’ve only started appearing in large numbers in late Apr 2025. METR report on GPT-5 states they gained access “four weeks prior to its release”, which means it was already essentially done by end on Jun 2025. So RLVRing a very large model on GB200 NVL72 is in principle possible in this timeframe, but probably not what happened, and more to the point given its level of performance probably not what needed to happen. This way, they get a better within-model gross margin and can work on the actual very large model in peace, maybe they’ll call it “GPT-6″.
The speed of GPT-5 could be explained by using GB200 NVL72 for inference, even if it’s an 8T total param model.
Ah, interesting! So the speed we see shouldn’t tell us much about GPT-5′s size.
I omitted one other factor from my shortform, namely cost. Do you think OpenAI would be willing to serve an 8T params (1T active) model for the price we’re seeing? I’m basically trying to understand whether GPT-5 being served for relatively cheap should be a large or small update.
Prefill (processing of input tokens) is efficient, something like 60% compute utilization might be possible, and that only depends on the number of active params. Generation of output tokens is HBM bandwidth bound, depends on the number of total params and the number of KV cache sequences for requests in a batch that fit on the same system (which share the cost of chip-time[1]). With GB200 NVL72, batches could be huge, dividing the cost of output tokens (still probably several times more expensive per token than prefill).
For prefill, we can directly estimate at-cost inference from the capital cost of compute hardware, assuming a need to pay it back in 3 years (it will likely serve longer but become increasingly obsolete). An H100 system costs about $50K per chip ($5bn for a 100K H100s system). This is all-in for compute equipment, so with networking but without buildings and cooling, since those serve longer and don’t need to be paid back in 3 years. Operational costs are maybe below 20%, which gives $20K per year per chip, or $2.3 per H100-hour. On gpulist, there are many listings at $1.80 per H100-hour, so my methodology might be somewhat overestimating the bare bones cost.
For GB200 NVL72, which are still too scarce to get a visible market price anywhere close to at-cost, the all-in cost together with external networking in a large system is plausibly around $5M per 72-chip rack ($7bn for a 100K chip GB200 NVL72 system, $30bn for Stargate Abilene’s 400K chips in GB200/GB300 NVL72 racks). This is 70K capital cost per chip, or 27.7K per year that pay it back in 3 years with 20% operational costs. This is just $3.2 per chip-hour.
A 1T active param model consumes 2e18 FLOPs for 1M tokens. GB200 chips can do 5e15 FP8 FLOP/s or 10e15 FP4 FLOP/s. At $3.2 per chip-hour and 60% utilization (for prefill), this translates to $0.6 per million input tokens at FP8, or $0.3 per million input tokens at FP4. The API price for the batch mode of GPT-5 is $0.62 input, $5 output. So it might even be possible with FP8. And the 8T total params wouldn’t matter with GB200 NVL72, they fit with space to spare in just one rack/domain.
This is an at-cost estimate, in contrast to the cloud provider prices. Oracle is currently selling 4-chip instances from GB200 at $16 per chip-hour. But it’s barely on the market for now, so the prices don’t yet reflect costs. And for example GCP is still selling an H100-hour for $8 (a3-megagpu-8g instances). So for the major clouds, the price of GB200 might end up only coming down to $11 per chip-hour in 2026-2027, even though the bare bones at-cost price is only $3.2 per chip-hour (or a bit lower).
I’m counting chips rather that GPUs to future-proof my terminology, since Huang recently proclaimed that starting with Rubin, compute dies will be considered GPUs (at March 2025 GTC, 1:28:04 into the keynote), so that a single chip will have 2 GPUs, and with Rubin Ultra a single chip will have 4 GPUs. It doesn’t help that Blackwell already has 2 compute dies per chip. This is sure to lead to confusion when counting things in GPUs, but counting in chips will remain less ambiguous.
Possibly an unlikely possibility, but could it be that different versions of GPT-5 (ie., normal model, thinking model, and thinking-pro model) are actually of different sizes? Or do we know for sure that they all share the same architecture?
A “GPT-5” named according to the previous convention in terms of pretraining compute would need at least 1e27 FLOPs (50x original GPT-4), which on H100/H200 can at best be done in FP8. Which could be done with 150K H100s for 3 months at 40% utilization. (GB200 NVL72 is too recent to use for this pretraining run, though there is a remote possiblity of B200.) A compute optimal shape for this model would be something like 8T total params, 1T active[1].
The speed of GPT-5 could be explained by using GB200 NVL72 for inference, even if it’s an 8T total param model. GPT-4.5 was slow and expensive likely because it needed many older 8-chip servers (which have 0.64-1.44 TB of HBM) to keep in HBM with room for KV caches, but a single GB200 NVL72 has 14 TB of HBM. At the same time, it wouldn’t help as much with the speed of smaller models (but it would help with their output token cost because you can fit more KV cache in the same NVLink domain, which isn’t necessarily yet being reflected in prices, since GB200 NVL72 is still scarce). So it remains somewhat plausible that GPT-5 is essentially GPT-4.5-thinking running on better hardware.
Its performance (quality, not speed) though suggests that it might well be a smaller model with pretraining scale RLVR (possibly like Grok 4, just done better). Also, doing RLVR on an 8T total param model using the older 8-chip servers would be slow/inefficient, and GB200 NVL72 might’ve only started appearing in large numbers in late Apr 2025. METR report on GPT-5 states they gained access “four weeks prior to its release”, which means it was already essentially done by end on Jun 2025. So RLVRing a very large model on GB200 NVL72 is in principle possible in this timeframe, but probably not what happened, and more to the point given its level of performance probably not what needed to happen. This way, they get a better within-model gross margin and can work on the actual very large model in peace, maybe they’ll call it “GPT-6″.
This is assuming 120 tokens/param as compute optimal, 40 tokens/param from Llama 3 405B as the dense anchor, and 3x that for 1:8 sparsity.
Ah, interesting! So the speed we see shouldn’t tell us much about GPT-5′s size.
I omitted one other factor from my shortform, namely cost. Do you think OpenAI would be willing to serve an 8T params (1T active) model for the price we’re seeing? I’m basically trying to understand whether GPT-5 being served for relatively cheap should be a large or small update.
Prefill (processing of input tokens) is efficient, something like 60% compute utilization might be possible, and that only depends on the number of active params. Generation of output tokens is HBM bandwidth bound, depends on the number of total params and the number of KV cache sequences for requests in a batch that fit on the same system (which share the cost of chip-time[1]). With GB200 NVL72, batches could be huge, dividing the cost of output tokens (still probably several times more expensive per token than prefill).
For prefill, we can directly estimate at-cost inference from the capital cost of compute hardware, assuming a need to pay it back in 3 years (it will likely serve longer but become increasingly obsolete). An H100 system costs about $50K per chip ($5bn for a 100K H100s system). This is all-in for compute equipment, so with networking but without buildings and cooling, since those serve longer and don’t need to be paid back in 3 years. Operational costs are maybe below 20%, which gives $20K per year per chip, or $2.3 per H100-hour. On gpulist, there are many listings at $1.80 per H100-hour, so my methodology might be somewhat overestimating the bare bones cost.
For GB200 NVL72, which are still too scarce to get a visible market price anywhere close to at-cost, the all-in cost together with external networking in a large system is plausibly around $5M per 72-chip rack ($7bn for a 100K chip GB200 NVL72 system, $30bn for Stargate Abilene’s 400K chips in GB200/GB300 NVL72 racks). This is 70K capital cost per chip, or 27.7K per year that pay it back in 3 years with 20% operational costs. This is just $3.2 per chip-hour.
A 1T active param model consumes 2e18 FLOPs for 1M tokens. GB200 chips can do 5e15 FP8 FLOP/s or 10e15 FP4 FLOP/s. At $3.2 per chip-hour and 60% utilization (for prefill), this translates to $0.6 per million input tokens at FP8, or $0.3 per million input tokens at FP4. The API price for the batch mode of GPT-5 is $0.62 input, $5 output. So it might even be possible with FP8. And the 8T total params wouldn’t matter with GB200 NVL72, they fit with space to spare in just one rack/domain.
This is an at-cost estimate, in contrast to the cloud provider prices. Oracle is currently selling 4-chip instances from GB200 at $16 per chip-hour. But it’s barely on the market for now, so the prices don’t yet reflect costs. And for example GCP is still selling an H100-hour for $8 (a3-megagpu-8g instances). So for the major clouds, the price of GB200 might end up only coming down to $11 per chip-hour in 2026-2027, even though the bare bones at-cost price is only $3.2 per chip-hour (or a bit lower).
I’m counting chips rather that GPUs to future-proof my terminology, since Huang recently proclaimed that starting with Rubin, compute dies will be considered GPUs (at March 2025 GTC, 1:28:04 into the keynote), so that a single chip will have 2 GPUs, and with Rubin Ultra a single chip will have 4 GPUs. It doesn’t help that Blackwell already has 2 compute dies per chip. This is sure to lead to confusion when counting things in GPUs, but counting in chips will remain less ambiguous.
Possibly an unlikely possibility, but could it be that different versions of GPT-5 (ie., normal model, thinking model, and thinking-pro model) are actually of different sizes? Or do we know for sure that they all share the same architecture?