Do you know why it takes such a long time to deploy a new rack system at scale? In my mind you slap on the new Rubin chips, more HBM, and you are good to go. (In your linked comment you mention “reliability issues”, is that where the bulk of the time comes from? (I did not read the linked semianalysis article.)) Or does everything, including e.g. cooling and interconnects, have to be redesigned from scratch for each new rack system, so you can’t reuse any of the older proven/reliable components?
That things other than chips need to be redesigned wouldn’t argue either way, because in that hypothetical everything could just come together at once, the other things the same way as the chips themselves. The issue is capacity of factories and labor for all the stuff and integration and construction. You can’t produce everything all at once, instead you need to produce each kind of thing that goes into the finished datacenters over the course of at least months, maybe as long as 2 years for sufficiently similar variants of a system that can share many steps of the process (as with H100/H200/B200 previously, and now GB200/GB300 NVL72).
How elaborate the production process needs to be also doesn’t matter, it just shifts the arrival of the finished systems in time (even if substantially), with the first systems still getting ready earlier than the bulk of them. And so the first 20% of everything (at a given stage of production) will be ready partway into the volume production period (in a broad sense that also includes construction of datacenter buildings or burn-in of racks), significantly earlier than most of it.
Do you know why it takes such a long time to deploy a new rack system at scale? In my mind you slap on the new Rubin chips, more HBM, and you are good to go. (In your linked comment you mention “reliability issues”, is that where the bulk of the time comes from? (I did not read the linked semianalysis article.)) Or does everything, including e.g. cooling and interconnects, have to be redesigned from scratch for each new rack system, so you can’t reuse any of the older proven/reliable components?
That things other than chips need to be redesigned wouldn’t argue either way, because in that hypothetical everything could just come together at once, the other things the same way as the chips themselves. The issue is capacity of factories and labor for all the stuff and integration and construction. You can’t produce everything all at once, instead you need to produce each kind of thing that goes into the finished datacenters over the course of at least months, maybe as long as 2 years for sufficiently similar variants of a system that can share many steps of the process (as with H100/H200/B200 previously, and now GB200/GB300 NVL72).
How elaborate the production process needs to be also doesn’t matter, it just shifts the arrival of the finished systems in time (even if substantially), with the first systems still getting ready earlier than the bulk of them. And so the first 20% of everything (at a given stage of production) will be ready partway into the volume production period (in a broad sense that also includes construction of datacenter buildings or burn-in of racks), significantly earlier than most of it.