I don’t understand why HBM per scale-up world is a major constraint for inference. For Deepseek V3, “The minimum deployment unit of the decoding stage consists of 40 nodes with 320 GPUs.” (section 3.4.2 https://arxiv.org/pdf/2412.19437v1). This seems like evidence that you can get reasonably efficient inference out of multi-node setups.
If using H800s, this means that their minimum deployment unit has 25.6 TB of memory. In general it seems like there are probably a lot of engineering tricks you can do to get efficient inference out of multi-node setups.
I don’t understand why HBM per scale-up world is a major constraint for inference. For Deepseek V3, “The minimum deployment unit of the decoding stage consists of 40 nodes with 320 GPUs.” (section 3.4.2 https://arxiv.org/pdf/2412.19437v1). This seems like evidence that you can get reasonably efficient inference out of multi-node setups.
If using H800s, this means that their minimum deployment unit has 25.6 TB of memory. In general it seems like there are probably a lot of engineering tricks you can do to get efficient inference out of multi-node setups.