OpenAI is going to remove GPT-4.5 from the API on July 14, 2025.
At the time I read that as an early announcement of when they are releasing GPT-4.5-thinking (with a controllable thinking budget), possibly to be called “GPT-5”, so that the non-thinking GPT-4.5 becomes obsolete. The first GB200 NVL72s might be coming online about now, which should allow both fast serving and more reasonable pricing for very large models, even with reasoning.
I don’t get it. Have these people not heard of prices?
The issue with very large models is that you need some minimal number of GPUs to keep them in memory, and you can’t serve them at all if you use fewer GPUs than that. If almost nobody uses the model, you are still paying for all the time of those GPUs. If GPT-4.5 is a 1:8 sparse MoE model pretrained in FP8 (announcement video mentioned training in low precision) on 100K H100s (Azure Goodyear campus), it could be about 5e26 FLOPs. At 1:8 sparsity, compute optimal tokens per active param ratio is 3x the dense ratio, and the dense ratio is about 40 tokens per param. So that gives 830B active params, 6.7T total params, and 100T training tokens.
The reason I chose 1:8 sparsity in the estimate is that a GB200 NVL72 rack has about 13 TB of HBM, and so 6.7T FP8 total params comfortably fit, leaving space for KV-cache. A GB200 NVL72 rack costs about $3M as Huang recently announced, or alternatively you might need 12 nodes of H200 (140 GB of HBM per chip, 96 chips), which is 3 racks (this will work more slowly and so will serve fewer requests for the same GPU-time), these racks might cost about $6M.
So that’s an anchor for the fixed costs, to reach marginal costs there needs to be a lot of active users to occupy multiple times of that amount most of the time. If there aren’t enough users, you still need to pay for the dedicated time of at least those 3 racks of H200 or 1 rack of GB200 NVL72, and that’s not something you can reasonably price.
At the time I read that as an early announcement of when they are releasing GPT-4.5-thinking (with a controllable thinking budget), possibly to be called “GPT-5”, so that the non-thinking GPT-4.5 becomes obsolete. The first GB200 NVL72s might be coming online about now, which should allow both fast serving and more reasonable pricing for very large models, even with reasoning.
The issue with very large models is that you need some minimal number of GPUs to keep them in memory, and you can’t serve them at all if you use fewer GPUs than that. If almost nobody uses the model, you are still paying for all the time of those GPUs. If GPT-4.5 is a 1:8 sparse MoE model pretrained in FP8 (announcement video mentioned training in low precision) on 100K H100s (Azure Goodyear campus), it could be about 5e26 FLOPs. At 1:8 sparsity, compute optimal tokens per active param ratio is 3x the dense ratio, and the dense ratio is about 40 tokens per param. So that gives 830B active params, 6.7T total params, and 100T training tokens.
The reason I chose 1:8 sparsity in the estimate is that a GB200 NVL72 rack has about 13 TB of HBM, and so 6.7T FP8 total params comfortably fit, leaving space for KV-cache. A GB200 NVL72 rack costs about $3M as Huang recently announced, or alternatively you might need 12 nodes of H200 (140 GB of HBM per chip, 96 chips), which is 3 racks (this will work more slowly and so will serve fewer requests for the same GPU-time), these racks might cost about $6M.
So that’s an anchor for the fixed costs, to reach marginal costs there needs to be a lot of active users to occupy multiple times of that amount most of the time. If there aren’t enough users, you still need to pay for the dedicated time of at least those 3 racks of H200 or 1 rack of GB200 NVL72, and that’s not something you can reasonably price.