Why do you think the frontier models still retain the sparsity levels of GPT-4 (roughly 1:8 active to total) at the time when open-weight models have gone much higher, with Kimi K2 having ~1:30 and most of other ones hovering around 1:20?
P. S.
After posting the comment above I remembered that Jensen Huang discussed a 2T total-param “GPT-MoE” with either 128k or 400k-token context windows in his NVidia GTC 2026 presentation last month: https://2slides.com/gallery/nvidia-gtc-2026-keynote-deck-jensen-huang-ai-factory-vision (slide 32 onward).[1] This corresponds exactly to GPT-5-series context length in ChatGPT and over API respectively, seems like quite strong evidence against the model sizes you suggest!
- ^
Not the first time he discloses that kind of info BTW: https://www.reddit.com/r/singularity/comments/1bi8rme/jensen_huang_just_gave_us_some_numbers_for_the
The DeepSeek report you cite is over a year old and seems to imply they quantized most (all?) weights to FP8 but the SemiAnalysis post states that by February 2026 “most frontier labs and inference providers are not running FP8”, especially since NVidia introduced native support of three 4-bit float formats on Blackwell last year. A cluster of 8 B200s has 1.5 TB of HBM3e and fits at least 2T of 4-bit params, maybe more. Hypothetically (a rough estimate due to questions in brackets which follow), think of 2200A110B plus 400 GB left on KV cache (or specifically “hot” cache if it’s possible to offload part of it to DRAM, is it? Do frontier labs use MLA? At least they don’t in their open-weights releases), or less if some params are in FP8 or BF16.
Reading your first comment again, you write: “Opus 4 probably targets Trainium 2 Ultra (6 TB of HBM per rack), so might be a 3T total param model”, did you assume whopping 3 TB on KV cache? BTW I agree that Claude 4-series were designed for FP8 inference, and perhaps GPT-5-series as well.
The aforementioned Kimi K2-series models are actually quantized into mixed precision (experts in INT4, everything else is in BF16) natively (with QAT) and have a practical memory footprint, according to different sources, around 550-600 GB, hence they should fit onto an 8 x H100/800 cluster. 500B+ models aren’t going to run on 640 GB in 8 bits indeed, but I suppose Z.ai wouldn’t increase the size of GLM-5 to 744A40B if they couldn’t inference it economically (perhaps they also use 4-bit quants? Or do they expect their clients to deploy it on B200s?). Anyway quite a few other models are even smaller, such as Minimax M2-series with 230A10B, refer to S. Raschka’s blog for further comparison. GPT-OSS 120B (117A5B) is clearly not limited in that way and still enjoys the ~1:20 sparsity.
And since you mentioned Musk anyway, he also disclosed recently that Grok 4.20 is 500B total params and estimated Sonnet to be ~1T and Opus to be ~5T.