Forgive if a naive question, but what about distributed training runs — any view on whether progress on that front will result in training runs larger than what you describe here?
PS thank you for the extremely useful analysis as always — I would 100% subscribe if you had a newsletter or something (with a strictly positive willingness-to-pay even!).
There doesn’t necessarily need to be algorithmic progress to get there, sufficient bandwidth enables traditional pretraining across multiple sites. But it might be difficult to ensure it’s available across the geographically distributed sites on short notice, if you aren’t already a well-established hyperscaler building near your older datacenter sites.
In 2028, targeting inference on Rubin Ultra NVL576 (150 TB of HBM in a scale-up world) might want a MoE model with 80 TB of total params (80T params if in FP8, 160T in FP4). If training uses the same precision for gradients, that’s also 80 TB of gradients to exchange. If averaged gradients use more precision, this could be 2x-8x more data.
If training is done using 2 GW of some kind of Rubin GPUs, that’s about 2e22-3e22 FP4 FLOP/s, and at 30% utilization for 4 months it produces 8e28 FP4 FLOPs. At 120 tokens/param (anchoring to 40 tokens/param for the dense Llama 3 405B and adjusting 3x for 1:8 sparsity), this system might want about 10T active params (so we get 1:16 sparsity, with 160T total FP4 params, or about 1:8 for FP8). This needs 1,200T tokens, maybe 250T unique, which is a problem, but not yet orders of magnitude beyond the pale, so probably something can still be done without needing bigger models.
With large scale-up worlds, processing sequences of 32K tokens with non-CPX Rubin NVL144 at 30% utilization would take just 2.7 seconds (for pretraining). A 2 GW system has 9K racks, so that’s a batch of 300M tokens, which is already a lot (Llama 3 405B used 16M token batches in the main phase of pretraining), so that should be the target characteristic time for exchanging gradients.
Moving 80 TB in 2.7 seconds needs 240 Tbps, or 500-2,000 Tbps if averaged gradients use 2x-8x more precision bits (even more if not all-to-all, which is likely with more than 2 sites), and this already loses half of utilization or asks for even larger batches. A DWDM system might transmit 30-70 Tbps over a fiber optic pair, so this is 4-70 fiber optic pairs, which seems in principle feasible to secure for overland fiber cables (which hold hundreds of pairs), especially towards the lower end of the estimate.
Forgive if a naive question, but what about distributed training runs — any view on whether progress on that front will result in training runs larger than what you describe here?
PS thank you for the extremely useful analysis as always — I would 100% subscribe if you had a newsletter or something (with a strictly positive willingness-to-pay even!).
There doesn’t necessarily need to be algorithmic progress to get there, sufficient bandwidth enables traditional pretraining across multiple sites. But it might be difficult to ensure it’s available across the geographically distributed sites on short notice, if you aren’t already a well-established hyperscaler building near your older datacenter sites.
In 2028, targeting inference on Rubin Ultra NVL576 (150 TB of HBM in a scale-up world) might want a MoE model with 80 TB of total params (80T params if in FP8, 160T in FP4). If training uses the same precision for gradients, that’s also 80 TB of gradients to exchange. If averaged gradients use more precision, this could be 2x-8x more data.
If training is done using 2 GW of some kind of Rubin GPUs, that’s about 2e22-3e22 FP4 FLOP/s, and at 30% utilization for 4 months it produces 8e28 FP4 FLOPs. At 120 tokens/param (anchoring to 40 tokens/param for the dense Llama 3 405B and adjusting 3x for 1:8 sparsity), this system might want about 10T active params (so we get 1:16 sparsity, with 160T total FP4 params, or about 1:8 for FP8). This needs 1,200T tokens, maybe 250T unique, which is a problem, but not yet orders of magnitude beyond the pale, so probably something can still be done without needing bigger models.
With large scale-up worlds, processing sequences of 32K tokens with non-CPX Rubin NVL144 at 30% utilization would take just 2.7 seconds (for pretraining). A 2 GW system has 9K racks, so that’s a batch of 300M tokens, which is already a lot (Llama 3 405B used 16M token batches in the main phase of pretraining), so that should be the target characteristic time for exchanging gradients.
Moving 80 TB in 2.7 seconds needs 240 Tbps, or 500-2,000 Tbps if averaged gradients use 2x-8x more precision bits (even more if not all-to-all, which is likely with more than 2 sites), and this already loses half of utilization or asks for even larger batches. A DWDM system might transmit 30-70 Tbps over a fiber optic pair, so this is 4-70 fiber optic pairs, which seems in principle feasible to secure for overland fiber cables (which hold hundreds of pairs), especially towards the lower end of the estimate.
As a naive follow-up: let’s say GPT-6 could be trained in 3 months on a 3GW cluster. Could I instead train it in 9 months on a 1GW cluster?