Key constraints are memory for storing KV-caches, scale-up world size (a smaller collection of chips networked at much higher bandwidth than outside such collections), and the number of concurrent requests. A model needs to be spread across many chips to fit in memory, leave enough space for KV-caches, and run faster. If there aren’t enough requests for inference, all these chips will be mostly idle, but the API provider will still need to pay for their time. If the model is placed on fewer chips, it won’t be able to process too many requests at all because otherwise the chips will run out of memory for KV-caches, and also each request will be processed slower.
So there is a minimal threshold in the number of users needed to serve a model at a given speed with a low cost. GB200 NVL72 is going to change this a lot, since it ups scale-up world size from 8 to 72 GPUs, and a B200 chip has 192 GB of HBM to H100′s 80 GB (though H200s have 141 GB). This allows to fit the same model on fewer chips while maintaining high speed (using fewer scale-up worlds) and processing many concurrent requests (having enough memory for many KV-caches), so the inference prices for larger models will probably collapse and Hoppers will become more useful for training experiments than inference (other than for the smallest models). It’s a greater change than between A100s and H100s, since both had 8 GPUs per scale-up world.
Key constraints are memory for storing KV-caches, scale-up world size (a smaller collection of chips networked at much higher bandwidth than outside such collections), and the number of concurrent requests. A model needs to be spread across many chips to fit in memory, leave enough space for KV-caches, and run faster. If there aren’t enough requests for inference, all these chips will be mostly idle, but the API provider will still need to pay for their time. If the model is placed on fewer chips, it won’t be able to process too many requests at all because otherwise the chips will run out of memory for KV-caches, and also each request will be processed slower.
So there is a minimal threshold in the number of users needed to serve a model at a given speed with a low cost. GB200 NVL72 is going to change this a lot, since it ups scale-up world size from 8 to 72 GPUs, and a B200 chip has 192 GB of HBM to H100′s 80 GB (though H200s have 141 GB). This allows to fit the same model on fewer chips while maintaining high speed (using fewer scale-up worlds) and processing many concurrent requests (having enough memory for many KV-caches), so the inference prices for larger models will probably collapse and Hoppers will become more useful for training experiments than inference (other than for the smallest models). It’s a greater change than between A100s and H100s, since both had 8 GPUs per scale-up world.