And already on February 17th, Colossus had 150k+ GPU. It seems that in the April message they were talking about 200k GPUs. Judging by Musk’s interview, this could mean 150,000 H100 and 50,000 H200.
Perhaps the time and GPU were enough to train a GPT-5 scale model?
The 10x Grok 2 claims weakly suggest 3e26 FLOPs rather than 6e26 FLOPs. The same opening slide of the Grok 4 livestream claims parity between Grok 3 and Grok 4 pretraining, and Grok 3 didn’t have more than 100K H100s to work with. API prices for Grok 3 and Grok 4 are also the same and relatively low ($3/$15 per input/output 1M tokens), so they might even be using the same pretrained model (or in any case a similarly-sized one).
Since Grok 3 was in use since early 2025, before GB200 NVL72 systems were available in sufficient numbers, it needs to be a smaller model than compute optimal with 100K H100s compute. At 1:8 MoE sparsity (active:total params), it’s compute optimal to have about 7T total params at 5e26 FLOPs, which in FP8 comfortably fit in one GB200 NVL72 rack (which has 13TB of HBM). So in principle right now a compute optimal system could be deployed even in a reasoning form, but it would still cost more, and it would need more GB200s than xAI seems to have to spare currently (even the near-future GB200s they will need to use for RLVR more urgently, if the above RLVR scaling interpretation of Grok 4 is correct).
Or 6e26 (in FP8 FLOPs).
And already on February 17th, Colossus had 150k+ GPU. It seems that in the April message they were talking about 200k GPUs. Judging by Musk’s interview, this could mean 150,000 H100 and 50,000 H200. Perhaps the time and GPU were enough to train a GPT-5 scale model?
The 10x Grok 2 claims weakly suggest 3e26 FLOPs rather than 6e26 FLOPs. The same opening slide of the Grok 4 livestream claims parity between Grok 3 and Grok 4 pretraining, and Grok 3 didn’t have more than 100K H100s to work with. API prices for Grok 3 and Grok 4 are also the same and relatively low ($3/$15 per input/output 1M tokens), so they might even be using the same pretrained model (or in any case a similarly-sized one).
Since Grok 3 was in use since early 2025, before GB200 NVL72 systems were available in sufficient numbers, it needs to be a smaller model than compute optimal with 100K H100s compute. At 1:8 MoE sparsity (active:total params), it’s compute optimal to have about 7T total params at 5e26 FLOPs, which in FP8 comfortably fit in one GB200 NVL72 rack (which has 13TB of HBM). So in principle right now a compute optimal system could be deployed even in a reasoning form, but it would still cost more, and it would need more GB200s than xAI seems to have to spare currently (even the near-future GB200s they will need to use for RLVR more urgently, if the above RLVR scaling interpretation of Grok 4 is correct).