The point about Colossus is that the expensive part of a cluster can be done within a few months (even if it won’t be able to start training models immediately), the unusual thing about a training system is that it both uses the same hardware for the whole thing, and needs the newest hardware. The cost of all the other parts of a datacenter is almost a rounding error (for example land comes out to below 1%). Altogether, this means that almost all capex for a training system describes the short phase where you install the hardware, and you want to get all the hardware in a narrow window of time, even as the various preliminary steps can take years. For GB200s, shipments in bulk start in Q2-Q3 2025 (which also strongly suggests training only starts in 2026). I’m not sure how far the payment for hardware can be moved from the actual shipments, but all else equal for Microsoft I expect FY 2025 (July 2024 to June 2025).
NVL72 GB200s are much better than 8x H100s for inference (much more HBM, much larger scale-up world size), can remain efficient with long context and larger models (this even weakly suggests that general deployment of larger models trained on 100K H100s will be delayed until late 2025, except for Google). So unclear if there need to be a lot of inference B200s compared to H100s before the training system also goes to inference. (Inference compute scales linearly with model size, while training compute scales with model size squared.)
The point about Colossus is that the expensive part of a cluster can be done within a few months (even if it won’t be able to start training models immediately), the unusual thing about a training system is that it both uses the same hardware for the whole thing, and needs the newest hardware. The cost of all the other parts of a datacenter is almost a rounding error (for example land comes out to below 1%). Altogether, this means that almost all capex for a training system describes the short phase where you install the hardware, and you want to get all the hardware in a narrow window of time, even as the various preliminary steps can take years. For GB200s, shipments in bulk start in Q2-Q3 2025 (which also strongly suggests training only starts in 2026). I’m not sure how far the payment for hardware can be moved from the actual shipments, but all else equal for Microsoft I expect FY 2025 (July 2024 to June 2025).
NVL72 GB200s are much better than 8x H100s for inference (much more HBM, much larger scale-up world size), can remain efficient with long context and larger models (this even weakly suggests that general deployment of larger models trained on 100K H100s will be delayed until late 2025, except for Google). So unclear if there need to be a lot of inference B200s compared to H100s before the training system also goes to inference. (Inference compute scales linearly with model size, while training compute scales with model size squared.)