The concrete news is a new $6 billion round, which enables xAI to follow through on the intention to add another 100K H100s (or a mix of H100s and H200s) to the existing 100K H100s. The timeline for a million GPUs remains unknown (and the means of powering them at that facility even more so).
Going fast with 1M H100s might be a bad idea if the problem with large minibatch sizes I hypothesize is real, that large minibatch sizes are both very bad and hard to avoid in practice when staying with too many H100s. (This could even be the reason for underwhelming scaling outcomes of the current wave of scaling, if that too is real, though not for Google.)
Aiming for 1M B200s only doubles or triples Microsoft’s planned 300K-700K B200s, so it’s not a decisive advantage and even less meaningful without a timeline (at some point Microsoft could be doubling or tripling training compute as well).
For the next few months Anthropic might have the compute lead (over OpenAI, Meta, xAI; Google is harder to guess). And if the Rainier cluster uses Trn2 Ultra rather than regular Trn2, there won’t even be a minibatch size problem there (if the problem is real), as unlike with H100s that form 8-GPU scale-up domains, the Trn2 Ultra machines have 64-GPU scale-up domains, for 41 units of H100-equivalent compute per scale-up domain.
The concrete news is a new $6 billion round, which enables xAI to follow through on the intention to add another 100K H100s (or a mix of H100s and H200s) to the existing 100K H100s. The timeline for a million GPUs remains unknown (and the means of powering them at that facility even more so).
Going fast with 1M H100s might be a bad idea if the problem with large minibatch sizes I hypothesize is real, that large minibatch sizes are both very bad and hard to avoid in practice when staying with too many H100s. (This could even be the reason for underwhelming scaling outcomes of the current wave of scaling, if that too is real, though not for Google.)
Aiming for 1M B200s only doubles or triples Microsoft’s planned 300K-700K B200s, so it’s not a decisive advantage and even less meaningful without a timeline (at some point Microsoft could be doubling or tripling training compute as well).
For the next few months Anthropic might have the compute lead (over OpenAI, Meta, xAI; Google is harder to guess). And if the Rainier cluster uses Trn2 Ultra rather than regular Trn2, there won’t even be a minibatch size problem there (if the problem is real), as unlike with H100s that form 8-GPU scale-up domains, the Trn2 Ultra machines have 64-GPU scale-up domains, for 41 units of H100-equivalent compute per scale-up domain.