In 2024, there were multiple sightings of training systems at the scale of 100K H100. Microsoft’s 3 buildings in Goodyear, Arizona, xAI’s Memphis cluster, Meta’s training system for Llama 4. Such systems cost $5bn, need 150 MW, and can pretrain a 4e26 FLOPs model in 4 months.
Then there are Google’s 100K TPUv6e clusters and Amazon’s 400K Trn2 cluster. Performance of a TPUv6e in dense BF16 is close to that of an H100, while 400K Trn2 produce about as much compute as 250K H100.
Anthropic might need more time than the other players to gets its new hardware running, but there is also an advantage to Trn2 and TPUv6e over H100, larger scale-up domains that enable more tensor parallelism and smaller minibatch sizes. This might be an issue when training on H100 at this scale[1] and explain some scaling difficulties for labs that are not Google, or Anthropic later in 2025 once the Trn2 cluster becomes useful.
Do we know much about TPU and Trn2 performance at lower precision? I expect most training runs are using 4-8 bit precision by this point.
Are there any signs to be found in public that anyone is training 10B+ LLMs in a precision that is not 16 bits? There are experiments that are specifically about precision on smaller LLMs, but they don’t seem to get adopted in practice for larger models, despite the obvious advantage of getting to 2x the compute.
DeepSeek-V3 might be the only example (and it’s from the future, released after I asked the question). Not sure if it generalizes to expecting more FP8 training, as it’s a MoE model with 257 experts and uses relatively small 7Kx2K matrices in its experts, while GPT-3-175B tested in FP8 in the Sep 2022 paper has much larger matrices, and that result wasn’t sufficient to promote widespread adoption (at least where it’s possible to observe).
On the other hand, if DeepSeek-V3 really is as good for its compute (4e24-6e24 FLOPs) as the benchmarks indicate, it might motivate more training with a huge number of smaller experts (it activates 8 experts per token, so the number of experts is even higher than one would expect from its ratio of total to active parameters). There was a Feb 2024 paper claiming 20x or higher compute multipliers for MoE models compared to dense (Figure 1b), appearing only if they activate a lot of experts per token, predicting 64 to be optimal at 1e24-1e25 FLOPs (the usual practice is to activate 2 experts). So DeepSeek-V3 weakly supports this surprising claim, though actual experimental results with more compute than that paper’s 3e19-4e20 FLOPs per datapoint would be better. The paper also predicts reduction in tokens per parameter with more compute (Table 2), reaching 8 tokens per active parameter at 5e25 FLOPs (in a MoE model with 4096 experts, 64 of which get activated per token). If this too is somehow correct, natural text data can be sufficient for 10 times more compute than with dense models.
Do we know much about TPU and Trn2 performance at lower precision? I expect most training runs are using 4-8 bit precision by this point.
Are there any signs to be found in public that anyone is training 10B+ LLMs in a precision that is not 16 bits? There are experiments that are specifically about precision on smaller LLMs, but they don’t seem to get adopted in practice for larger models, despite the obvious advantage of getting to 2x the compute.
Deepseek v3 is one example, and semianalysis has claimed that most labs use FP8.
DeepSeek-V3 might be the only example (and it’s from the future, released after I asked the question). Not sure if it generalizes to expecting more FP8 training, as it’s a MoE model with 257 experts and uses relatively small 7Kx2K matrices in its experts, while GPT-3-175B tested in FP8 in the Sep 2022 paper has much larger matrices, and that result wasn’t sufficient to promote widespread adoption (at least where it’s possible to observe).
On the other hand, if DeepSeek-V3 really is as good for its compute (4e24-6e24 FLOPs) as the benchmarks indicate, it might motivate more training with a huge number of smaller experts (it activates 8 experts per token, so the number of experts is even higher than one would expect from its ratio of total to active parameters). There was a Feb 2024 paper claiming 20x or higher compute multipliers for MoE models compared to dense (Figure 1b), appearing only if they activate a lot of experts per token, predicting 64 to be optimal at 1e24-1e25 FLOPs (the usual practice is to activate 2 experts). So DeepSeek-V3 weakly supports this surprising claim, though actual experimental results with more compute than that paper’s 3e19-4e20 FLOPs per datapoint would be better. The paper also predicts reduction in tokens per parameter with more compute (Table 2), reaching 8 tokens per active parameter at 5e25 FLOPs (in a MoE model with 4096 experts, 64 of which get activated per token). If this too is somehow correct, natural text data can be sufficient for 10 times more compute than with dense models.
This makes sense, I think you could be right. Llama 4 should give us more evidence on numerical precision and scaling of experts.