The DeepSeek report you cite is over a year old and seems to imply they quantized most (all?) weights to FP8 but the SemiAnalysis post states that by February 2026 “most frontier labs and inference providers are not running FP8”, especially since NVidia introduced native support of three 4-bit float formats on Blackwell last year. A cluster of 8 B200s has 1.5 TB of HBM3e and fits at least 2T of 4-bit params, maybe more. Hypothetically (a rough estimate due to questions in brackets which follow), think of 2200A110B plus 400 GB left on KV cache (or specifically “hot” cache if it’s possible to offload part of it to DRAM, is it? Do frontier labs use MLA? At least they don’t in their open-weights releases), or less if some params are in FP8 or BF16.
Reading your first comment again, you write: “Opus 4 probably targets Trainium 2 Ultra (6 TB of HBM per rack), so might be a 3T total param model”, did you assume whopping 3 TB on KV cache? BTW I agree that Claude 4-series were designed for FP8 inference, and perhaps GPT-5-series as well.
The aforementioned Kimi K2-series models are actually quantized into mixed precision (experts in INT4, everything else is in BF16) natively (with QAT) and have a practical memory footprint, according to different sources, around 550-600 GB, hence they should fit onto an 8 x H100/800 cluster. 500B+ models aren’t going to run on 640 GB in 8 bits indeed, but I suppose Z.ai wouldn’t increase the size of GLM-5 to 744A40B if they couldn’t inference it economically (perhaps they also use 4-bit quants? Or do they expect their clients to deploy it on B200s?). Anyway quite a few other models are even smaller, such as Minimax M2-series with 230A10B, refer to S. Raschka’s blog for furthercomparison. GPT-OSS 120B (117A5B) is clearly not limited in that way and still enjoys the ~1:20 sparsity.
And since you mentioned Musk anyway, he also disclosed recently that Grok 4.20 is 500B total params and estimated Sonnet to be ~1T and Opus to be ~5T.
Output tokens are bandwidth constrained, and the cost is divided among requests in a batch. So as we increase the number of requests in a batch, the cost goes down a lot while more HBM is used for the model weights than for the total KV cache. As the model gets smaller than the total KV cache of all requests, the cost starts being governed by the size of KV cache for a given request. So half-and-half is a rule of thumb I’m using for the upper bound of “efficient” (decode) deployment. It’s even better if it’s less than half.
Most weights of a model are in FFNs, but the non-FFN weights (and shared FFNs) might need to be replicated a bunch of times within the same deployment. So precision of FFNs is the most significant factor of how much space a model takes up (the other parts are often in different precisions), but the weights in a deployment take up more in total than a single copy of the weights. I did assume FP8 FFNs as a target for the largest models (where quality is at a premium, with many other tradeoffs also paying for quality with performance and cost), as a conservative assumption. For smaller models, FP4 might be more prevalent, particularly with the models that have to be squeezed into the 8-chip Nvidia servers and were trained for that with QAT.
I suppose Z.ai wouldn’t increase the size of GLM-5 to 744A40B if they couldn’t inference it economically
It’s economical to even pull DeepSeek-V3 for the smaller models and do expert parallelism across dozens of nodes in a single deployment. It’s just much worse than if you don’t have to, and less economical if the model has 15x more total params and 30x more active params (which also makes the activation vectors heavier). None of these arguments are about hard caps on how deployments have to happen, except for how sufficiently sluggish performance makes it impossible to do good RLVR training for a model that’s too large for the available hardware (because then you won’t be able to fit enough training steps in 3 months or however long it needs to take), even if there’s quite a lot of that hardware. The incentives probably point to targeting in-principle efficient deployment when possible, and to come what may when not.
Don’t you see a contradiction between your earlier line of argument “OpenAI should have designed GPT-5 to fit on 8 H100s in FP8 despite gradually getting B200s with more HBM and FP4 support” and “one can deploy across several nodes it’s just less effective”? Selecting the size of a future model for the hardware you are getting rather than one which dominates your park right now seems a wiser solution in light of your latter thesis.
Also, checking EpochAI’s free estimates, Microsoft had roughly as much Hopper compute as Blackwell in Q2 2025, and B200+B300 passed over H100s in Q3. What makes you think H100s were more important to OpenAI in August 2025 than Blackwell? Do you have access to better estimates (maybe SemiAnalysis)?
OpenAI’s “largest” models are not the main GPT-5-series, but GPT-5/5.2/5.4 Pro, and I see no downsides in quantizing the ones served to free users as hard as they can. There’s a relatively widespread belief that Pros are based on the same base models as Chat, just with different post-training, larger quants and inference scaling. Do you share that belief?
Many people also place GPT-5 between Sonnet 4 and Opus 4 in size, partially based on performance (both benchmarks and vibes), partially on price and partially on the fact that Pros are somewhat competitive with Opus. What about you?
After thinking more about it, I personally still believe GPT-5-series are all around 2000A100B: GPT-5 in FP8 (hereinafter formats refers to experts), GPT-5 Pro in BF16, GPT-5.4 in FP4 with QAT (seems to fit well onto 8xB200 with 128k context), GPT-5.4 Pro in FP8. Oh, and BTW, I realized Huang’s 2T couldn’t have been a holdover from GPT-4 because two years ago he actually mentioned “GPT-MoE-1.8T”.
As of Anthropic models, I would place Opus 4 about double GPT-5, and Sonnet 4 somewhere in between it and Kimi K2 (say, around 1200A60B). Not sure about quantization, since Amazon hardware will only natively support FP4 from Trainium 4, but INT4 is always an option.
Added at the last moment: I briefly looked up scaling law research on sparsity, and Abnar et al. 2025 appear to indicate the larger a MoE, the more beneficial sparsity becomes. I haven’t had time yet to read it carefully (Friday, you know) but it seems to support the empirical trend on increasing sparsity
It’s not a choice of B200 vs. H100 when there’s a general compute shortage, the flagship model has to be able to run on everything (since you wouldn’t be sure about the demand in advance), and profitably at the prices you’re setting. I’m assuming, maybe incorrectly, that the smaller models (than the largest/flagship “GPT-5” model) don’t predictably-in-advance use more than half of all compute (in other words, wouldn’t bring in more than half of all revenue, if counterfactually all tokens were served via API at API prices with similar-to-each-other margins).
The argument is about incentives. A 2x larger model is meaningfully but not materially smarter, but if it breaks out of an efficient serving regime, then it’s suddenly 3x more expensive (and needs 3x as much serving capacity across your datacenters, if demand stays the same), so you try to avoid getting there. On the other hand, if you are already in the inefficient serving regime (as many Chinese models are, and possibly GPT-5 in 2025), or you have to get there to reach the capabilities you need, then making a model even larger will be less impactful, so you might be willing to do that. But the benefit might also be marginal, so it’s a wash, reasonable decisions will point in either direction. Conversely, if new hardware is coming out that will let you get back into an efficient regime with a model that you previously had to serve inefficiently, maybe you’ll cut corners to get it done.
For GPT-5 in 2026, my guess is that GPT-5.4 (and maybe 5.3) is a new pretrain, and in 2026 it’s more plausible to rely exclusively on B200s for the flagship model, leaving H100s and maybe H200s for the smaller models. So maybe GPT-5.4 is actually larger (in total params) than GPT-5.2/GPT-5.0/o3/GPT-4o, which is what I’m estimating at 300B-900B total params, based on the quality of Chinese models (what they’re demonstrating to be possible with the model sizes they’re using), availability of pretraining compute to OpenAI (thus the feasibility of using more active params than the Chinese models, and the expectation that OpenAI can get the same quality into smaller models), and the need to be able to serve models on H100/A100 servers. On the 300B side is the hypothetical model that does fit on H100/A100 servers, but 300B might be too small for the model to be good enough. On the 900B side is essentially the Chinese consensus that the available servers are too small and you have to give up on high efficiency to make a good model. Here, the constraint is that I don’t expect an OpenAI model of GPT-5.0 to GPT-5.2 quality needs to be larger than 900B total params.
the flagship model has to be able to run on everything (since you wouldn’t be sure about the demand in advance)
I disagree. EpochAI gives the ratio of Blackwell to Hopper at Azure in Q3 2025 as 5:4, let’s take it at face value for a moment and ignore MI300X, A100 etc. Do you really believe compute that R&D experiments, training of Sora 2 and inference of GPT-Image-1, o1, o3, 4o together with all the mini models didn’t combine to ~4/9 of compute? If you check the 3rd-party websites they all show more usage of 4-series models than GPT-5 as late as September. I believe OpenAI could predict how much demand there would be for GPT-5, infer it wouldn’t consume significantly over a half of their compute and reroute free users to cheaper models if there’s too much demand.
You have not answered my question in the third paragraph of the comment above, how large is Sonnet 4? As of your silence on some other questions/topics, I guess, sapienti sat
These are some solid contributions to the seemingly rare critiques of Vladmir’s interpretations of otherwise fantastically aggregated public data, but I must say there is still some interesting jaggedness to it—like I’m surprised you are not aware of Pro being primarily consensus-based (ex. sampling 10x instances per token into consensus via similar mechanism as speculative decoding). Plenty of public reporting on it across the usual suspects. Plus you can more or less implement it yourself with local models and see the exact same performance-improvement and throughput-reduction curves!
To whack-a-mole another, you correctly call out how early both Blackwell was deployed in volume and targeted for deployment, but still manage to miss the forest for the trees on how critical native Q4 is for everything inference: inference-time scaling (ITS) is king, across both RL and deployed-model performance. In other words, inferencing more tokens, faster, annihilates any improvement you get from running Q8+ on Q4 native hardware, and all modern lab deployments + mid/post trainings are running Q4 (apart from the usual caveats) wherever possible. (With total/active size delicately balanced between the ITS, parameter-scaling, and cost trifecta)
TBH I have read but forgotten the details of the implementation you describe, in this discussion it was only relevant whether the size is the same so I didn’t bother to remember or research anew.
As of quantization, maybe you are right, I haven’t investigated it in details! I don’t think it’s a crux of our disagreement with Vladimir anyway
The DeepSeek report you cite is over a year old and seems to imply they quantized most (all?) weights to FP8 but the SemiAnalysis post states that by February 2026 “most frontier labs and inference providers are not running FP8”, especially since NVidia introduced native support of three 4-bit float formats on Blackwell last year. A cluster of 8 B200s has 1.5 TB of HBM3e and fits at least 2T of 4-bit params, maybe more. Hypothetically (a rough estimate due to questions in brackets which follow), think of 2200A110B plus 400 GB left on KV cache (or specifically “hot” cache if it’s possible to offload part of it to DRAM, is it? Do frontier labs use MLA? At least they don’t in their open-weights releases), or less if some params are in FP8 or BF16.
Reading your first comment again, you write: “Opus 4 probably targets Trainium 2 Ultra (6 TB of HBM per rack), so might be a 3T total param model”, did you assume whopping 3 TB on KV cache? BTW I agree that Claude 4-series were designed for FP8 inference, and perhaps GPT-5-series as well.
The aforementioned Kimi K2-series models are actually quantized into mixed precision (experts in INT4, everything else is in BF16) natively (with QAT) and have a practical memory footprint, according to different sources, around 550-600 GB, hence they should fit onto an 8 x H100/800 cluster. 500B+ models aren’t going to run on 640 GB in 8 bits indeed, but I suppose Z.ai wouldn’t increase the size of GLM-5 to 744A40B if they couldn’t inference it economically (perhaps they also use 4-bit quants? Or do they expect their clients to deploy it on B200s?). Anyway quite a few other models are even smaller, such as Minimax M2-series with 230A10B, refer to S. Raschka’s blog for further comparison. GPT-OSS 120B (117A5B) is clearly not limited in that way and still enjoys the ~1:20 sparsity.
And since you mentioned Musk anyway, he also disclosed recently that Grok 4.20 is 500B total params and estimated Sonnet to be ~1T and Opus to be ~5T.
Output tokens are bandwidth constrained, and the cost is divided among requests in a batch. So as we increase the number of requests in a batch, the cost goes down a lot while more HBM is used for the model weights than for the total KV cache. As the model gets smaller than the total KV cache of all requests, the cost starts being governed by the size of KV cache for a given request. So half-and-half is a rule of thumb I’m using for the upper bound of “efficient” (decode) deployment. It’s even better if it’s less than half.
Most weights of a model are in FFNs, but the non-FFN weights (and shared FFNs) might need to be replicated a bunch of times within the same deployment. So precision of FFNs is the most significant factor of how much space a model takes up (the other parts are often in different precisions), but the weights in a deployment take up more in total than a single copy of the weights. I did assume FP8 FFNs as a target for the largest models (where quality is at a premium, with many other tradeoffs also paying for quality with performance and cost), as a conservative assumption. For smaller models, FP4 might be more prevalent, particularly with the models that have to be squeezed into the 8-chip Nvidia servers and were trained for that with QAT.
It’s economical to even pull DeepSeek-V3 for the smaller models and do expert parallelism across dozens of nodes in a single deployment. It’s just much worse than if you don’t have to, and less economical if the model has 15x more total params and 30x more active params (which also makes the activation vectors heavier). None of these arguments are about hard caps on how deployments have to happen, except for how sufficiently sluggish performance makes it impossible to do good RLVR training for a model that’s too large for the available hardware (because then you won’t be able to fit enough training steps in 3 months or however long it needs to take), even if there’s quite a lot of that hardware. The incentives probably point to targeting in-principle efficient deployment when possible, and to come what may when not.
Don’t you see a contradiction between your earlier line of argument “OpenAI should have designed GPT-5 to fit on 8 H100s in FP8 despite gradually getting B200s with more HBM and FP4 support” and “one can deploy across several nodes it’s just less effective”? Selecting the size of a future model for the hardware you are getting rather than one which dominates your park right now seems a wiser solution in light of your latter thesis.
Also, checking EpochAI’s free estimates, Microsoft had roughly as much Hopper compute as Blackwell in Q2 2025, and B200+B300 passed over H100s in Q3. What makes you think H100s were more important to OpenAI in August 2025 than Blackwell? Do you have access to better estimates (maybe SemiAnalysis)?
OpenAI’s “largest” models are not the main GPT-5-series, but GPT-5/5.2/5.4 Pro, and I see no downsides in quantizing the ones served to free users as hard as they can. There’s a relatively widespread belief that Pros are based on the same base models as Chat, just with different post-training, larger quants and inference scaling. Do you share that belief?
Many people also place GPT-5 between Sonnet 4 and Opus 4 in size, partially based on performance (both benchmarks and vibes), partially on price and partially on the fact that Pros are somewhat competitive with Opus. What about you?
After thinking more about it, I personally still believe GPT-5-series are all around 2000A100B: GPT-5 in FP8 (hereinafter formats refers to experts), GPT-5 Pro in BF16, GPT-5.4 in FP4 with QAT (seems to fit well onto 8xB200 with 128k context), GPT-5.4 Pro in FP8. Oh, and BTW, I realized Huang’s 2T couldn’t have been a holdover from GPT-4 because two years ago he actually mentioned “GPT-MoE-1.8T”.
As of Anthropic models, I would place Opus 4 about double GPT-5, and Sonnet 4 somewhere in between it and Kimi K2 (say, around 1200A60B). Not sure about quantization, since Amazon hardware will only natively support FP4 from Trainium 4, but INT4 is always an option.
Added at the last moment: I briefly looked up scaling law research on sparsity, and Abnar et al. 2025 appear to indicate the larger a MoE, the more beneficial sparsity becomes. I haven’t had time yet to read it carefully (Friday, you know) but it seems to support the empirical trend on increasing sparsity
It’s not a choice of B200 vs. H100 when there’s a general compute shortage, the flagship model has to be able to run on everything (since you wouldn’t be sure about the demand in advance), and profitably at the prices you’re setting. I’m assuming, maybe incorrectly, that the smaller models (than the largest/flagship “GPT-5” model) don’t predictably-in-advance use more than half of all compute (in other words, wouldn’t bring in more than half of all revenue, if counterfactually all tokens were served via API at API prices with similar-to-each-other margins).
The argument is about incentives. A 2x larger model is meaningfully but not materially smarter, but if it breaks out of an efficient serving regime, then it’s suddenly 3x more expensive (and needs 3x as much serving capacity across your datacenters, if demand stays the same), so you try to avoid getting there. On the other hand, if you are already in the inefficient serving regime (as many Chinese models are, and possibly GPT-5 in 2025), or you have to get there to reach the capabilities you need, then making a model even larger will be less impactful, so you might be willing to do that. But the benefit might also be marginal, so it’s a wash, reasonable decisions will point in either direction. Conversely, if new hardware is coming out that will let you get back into an efficient regime with a model that you previously had to serve inefficiently, maybe you’ll cut corners to get it done.
For GPT-5 in 2026, my guess is that GPT-5.4 (and maybe 5.3) is a new pretrain, and in 2026 it’s more plausible to rely exclusively on B200s for the flagship model, leaving H100s and maybe H200s for the smaller models. So maybe GPT-5.4 is actually larger (in total params) than GPT-5.2/GPT-5.0/o3/GPT-4o, which is what I’m estimating at 300B-900B total params, based on the quality of Chinese models (what they’re demonstrating to be possible with the model sizes they’re using), availability of pretraining compute to OpenAI (thus the feasibility of using more active params than the Chinese models, and the expectation that OpenAI can get the same quality into smaller models), and the need to be able to serve models on H100/A100 servers. On the 300B side is the hypothetical model that does fit on H100/A100 servers, but 300B might be too small for the model to be good enough. On the 900B side is essentially the Chinese consensus that the available servers are too small and you have to give up on high efficiency to make a good model. Here, the constraint is that I don’t expect an OpenAI model of GPT-5.0 to GPT-5.2 quality needs to be larger than 900B total params.
I disagree. EpochAI gives the ratio of Blackwell to Hopper at Azure in Q3 2025 as 5:4, let’s take it at face value for a moment and ignore MI300X, A100 etc. Do you really believe compute that R&D experiments, training of Sora 2 and inference of GPT-Image-1, o1, o3, 4o together with all the mini models didn’t combine to ~4/9 of compute? If you check the 3rd-party websites they all show more usage of 4-series models than GPT-5 as late as September. I believe OpenAI could predict how much demand there would be for GPT-5, infer it wouldn’t consume significantly over a half of their compute and reroute free users to cheaper models if there’s too much demand.
You have not answered my question in the third paragraph of the comment above, how large is Sonnet 4? As of your silence on some other questions/topics, I guess, sapienti sat
These are some solid contributions to the seemingly rare critiques of Vladmir’s interpretations of otherwise fantastically aggregated public data, but I must say there is still some interesting jaggedness to it—like I’m surprised you are not aware of Pro being primarily consensus-based (ex. sampling 10x instances per token into consensus via similar mechanism as speculative decoding). Plenty of public reporting on it across the usual suspects. Plus you can more or less implement it yourself with local models and see the exact same performance-improvement and throughput-reduction curves!
To whack-a-mole another, you correctly call out how early both Blackwell was deployed in volume and targeted for deployment, but still manage to miss the forest for the trees on how critical native Q4 is for everything inference: inference-time scaling (ITS) is king, across both RL and deployed-model performance. In other words, inferencing more tokens, faster, annihilates any improvement you get from running Q8+ on Q4 native hardware, and all modern lab deployments + mid/post trainings are running Q4 (apart from the usual caveats) wherever possible. (With total/active size delicately balanced between the ITS, parameter-scaling, and cost trifecta)
TBH I have read but forgotten the details of the implementation you describe, in this discussion it was only relevant whether the size is the same so I didn’t bother to remember or research anew.
As of quantization, maybe you are right, I haven’t investigated it in details! I don’t think it’s a crux of our disagreement with Vladimir anyway