Grok 4 is not just plausibly SOTA, the opening slide of its livestream presentation (at 2:29 in the video) slightly ambiguously suggests that Grok 4 used as much RLVR training as it had pretraining, which is itself at the frontier level (100K H100s, plausibly about 3e26 FLOPs). This amount of RLVR scaling was never claimed before (it’s not being claimed very clearly here either, but it is what the literal interpretation of the slide says; also almost certainly the implied compute parity is in terms of GPU-time, not FLOPs).
Thus it’s plausibly a substantially new kind of model, not just a well-known kind of model with SOTA capabilities, and so it could be unusually impactful to study its safety properties.
It’s somewhat unnerving to have intelligence created that is far greater than our own. And it’ll be bad or good for humanity. I think it’ll be good, most likely it’ll be good. But I’ve somewhat reconciled myself to the fact that even if it wasn’t gonna be good, I’d at least like to be alive to see it happen.
RLVR involves decoding (generating) 10K-50K long sequences of tokens, so its compute utilization is much worse than pretraining, especially on H100/H200 if the whole model doesn’t fit in one node (scale-up world). The usual distinction in input/output token prices reflects this, since processing of input tokens (prefill) is algorithmically closer to pretraining, while processing of output tokens (decoding) is closer to RLVR.
The 1:5 ratio in API prices for input and output tokens is somewhat common (it’s this way for Grok 3 and Grok 4), and it might reflect the ratio in compute utilization, since the API provider pays for GPU-time rather than the actually utilized compute. So if Grok 4 used the same total GPU-time for RLVR as it used for pretraining (such as 3 months on 100K H100s), it might’ve used 5 times less FLOPs in the process. This is what I meant by “compute parity is in terms of GPU-time, not FLOPs” in the comment above.
GB200 NVL72 (13TB HBM) will be improving utilization during RLVR for large models that don’t fit in H200 NVL8 (1.1TB) or B200 NVL8 (1.4TB) nodes with room to spare for KV cache, which is likely all of the 2025 frontier models. So this opens the possibility of both doing a lot of RLVR in reasonable time for even larger models (such as compute optimal models at 5e26 FLOPs), and also for using longer reasoning traces than the current 10K-50K tokens.
And already on February 17th, Colossus had 150k+ GPU. It seems that in the April message they were talking about 200k GPUs. Judging by Musk’s interview, this could mean 150,000 H100 and 50,000 H200.
Perhaps the time and GPU were enough to train a GPT-5 scale model?
The 10x Grok 2 claims weakly suggest 3e26 FLOPs rather than 6e26 FLOPs. The same opening slide of the Grok 4 livestream claims parity between Grok 3 and Grok 4 pretraining, and Grok 3 didn’t have more than 100K H100s to work with. API prices for Grok 3 and Grok 4 are also the same and relatively low ($3/$15 per input/output 1M tokens), so they might even be using the same pretrained model (or in any case a similarly-sized one).
Since Grok 3 was in use since early 2025, before GB200 NVL72 systems were available in sufficient numbers, it needs to be a smaller model than compute optimal with 100K H100s compute. At 1:8 MoE sparsity (active:total params), it’s compute optimal to have about 7T total params at 5e26 FLOPs, which in FP8 comfortably fit in one GB200 NVL72 rack (which has 13TB of HBM). So in principle right now a compute optimal system could be deployed even in a reasoning form, but it would still cost more, and it would need more GB200s than xAI seems to have to spare currently (even the near-future GB200s they will need to use for RLVR more urgently, if the above RLVR scaling interpretation of Grok 4 is correct).
Grok 4 is not just plausibly SOTA, the opening slide of its livestream presentation (at 2:29 in the video) slightly ambiguously suggests that Grok 4 used as much RLVR training as it had pretraining, which is itself at the frontier level (100K H100s, plausibly about 3e26 FLOPs). This amount of RLVR scaling was never claimed before (it’s not being claimed very clearly here either, but it is what the literal interpretation of the slide says; also almost certainly the implied compute parity is in terms of GPU-time, not FLOPs).
Thus it’s plausibly a substantially new kind of model, not just a well-known kind of model with SOTA capabilities, and so it could be unusually impactful to study its safety properties.
Another takeaway from the livestream is the following bit of AI risk attitude Musk shared (at 14:29):
What do you think of this argument that Grok 4 used only ~1/5th RLVR training as pretraining (~3e26 pre-training + ~6e25 RLVR)? https://x.com/tmychow/status/1943460487565578534
RLVR involves decoding (generating) 10K-50K long sequences of tokens, so its compute utilization is much worse than pretraining, especially on H100/H200 if the whole model doesn’t fit in one node (scale-up world). The usual distinction in input/output token prices reflects this, since processing of input tokens (prefill) is algorithmically closer to pretraining, while processing of output tokens (decoding) is closer to RLVR.
The 1:5 ratio in API prices for input and output tokens is somewhat common (it’s this way for Grok 3 and Grok 4), and it might reflect the ratio in compute utilization, since the API provider pays for GPU-time rather than the actually utilized compute. So if Grok 4 used the same total GPU-time for RLVR as it used for pretraining (such as 3 months on 100K H100s), it might’ve used 5 times less FLOPs in the process. This is what I meant by “compute parity is in terms of GPU-time, not FLOPs” in the comment above.
GB200 NVL72 (13TB HBM) will be improving utilization during RLVR for large models that don’t fit in H200 NVL8 (1.1TB) or B200 NVL8 (1.4TB) nodes with room to spare for KV cache, which is likely all of the 2025 frontier models. So this opens the possibility of both doing a lot of RLVR in reasonable time for even larger models (such as compute optimal models at 5e26 FLOPs), and also for using longer reasoning traces than the current 10K-50K tokens.
Or 6e26 (in FP8 FLOPs).
And already on February 17th, Colossus had 150k+ GPU. It seems that in the April message they were talking about 200k GPUs. Judging by Musk’s interview, this could mean 150,000 H100 and 50,000 H200. Perhaps the time and GPU were enough to train a GPT-5 scale model?
The 10x Grok 2 claims weakly suggest 3e26 FLOPs rather than 6e26 FLOPs. The same opening slide of the Grok 4 livestream claims parity between Grok 3 and Grok 4 pretraining, and Grok 3 didn’t have more than 100K H100s to work with. API prices for Grok 3 and Grok 4 are also the same and relatively low ($3/$15 per input/output 1M tokens), so they might even be using the same pretrained model (or in any case a similarly-sized one).
Since Grok 3 was in use since early 2025, before GB200 NVL72 systems were available in sufficient numbers, it needs to be a smaller model than compute optimal with 100K H100s compute. At 1:8 MoE sparsity (active:total params), it’s compute optimal to have about 7T total params at 5e26 FLOPs, which in FP8 comfortably fit in one GB200 NVL72 rack (which has 13TB of HBM). So in principle right now a compute optimal system could be deployed even in a reasoning form, but it would still cost more, and it would need more GB200s than xAI seems to have to spare currently (even the near-future GB200s they will need to use for RLVR more urgently, if the above RLVR scaling interpretation of Grok 4 is correct).