TLW comments on Inference cost limits the impact of ever larger models

TLW 11 Nov 2021 3:38 UTC
2 points
0
I am glad we were able to work out the matter!

> If this works, we may be able to deploy massive future neural nets on clusters no bigger than the ones we have today.

Beware bandwidth bottlenecks, as I mentioned in my original post. If you have a 1TB model, you need to have it somewhere with >=1TB/s effective bandwidth between storage and the compute endpoint to achieve 1 second of latency when doing an inference. And storage capacity (not to mention model size) keeps rising faster than bandwidth does...

(There are tricks here to an extent—such as compressing the model and decompressing it on-target—but they seldom save much. (And if they do, that just means your model is inefficient...))

According to a random guy on the internet, GPT-3 is ~300GB compressed. PCIe gen4x16 is ~31.5GB/s. If you have 1s of latency, that means that you can only stream in ~31.5GB per card. (In addition to what’s already stored in RAM.)

That being said, as far as I can tell it is—in theory—possible to run a GPT-3 inference on a single Threadripper Pro platform (or something else with 128 lanes of gen4 pcie), with 8x 6GB graphics cards in 1 second, if you have 300GB of DRAM lying around. (Or 4x 12GB graphics cards in 2 seconds, with the other half of the pcie lanes filled with gen4 SSDs.)

(In practice I strongly suspect you’ll hit some unknown limit in the PCIe root complex or thereabouts. This is shuffling something silly like 250GB/s of data through that one poor root complex.)

(It’s a pity that there’s no good way to ask a GPU to pull data directly from an SSD. ICMB could help, but it requires GPU-side software support. Most of this data stream could go directly from SSD to PCIe switch to graphics card without having to be bounced through the root port...)

(Yes, 8x gpu->gpu communications will hurt overall latency… but not by all that much I don’t think. 1 second is an eternity.)

> As I think we both agree, pipelining, in the sense of using different GPUs to compute different layers, doesn’t reduce latency.

Indeed. And indeed, increases it, as you’re adding GPU-->GPU trips to the critical path.
- SoerenMind 11 Nov 2021 18:42 UTC
  1 point
  0
  Parent
  
  Beware bandwidth bottlenecks, as I mentioned in my original post.
  
  Presumably bandwidth requirements can be reduced a lot through width-wise parallelism. Each GPU only has to load one slice of the model then. Of course you’ll need more GPUs then but still not a crazy number as long as you use something like ZeRO-infinity.
  
  (Yes, 8x gpu->gpu communications will hurt overall latency… but not by all that much I don’t think. 1 second is an eternity.)
  
  Width-wise communication, if you mean that, can be quite a latency bottleneck for training. And it gets worse when you make the model wider or the batch bigger, which of course people are constantly doing. But for inference I guess you can reduce the latency if you’re willing to use a small batch size.
  - TLW 14 Nov 2021 19:59 UTC
    1 point
    0
    Parent
    Presumably bandwidth requirements can be reduced a lot through width-wise parallelism.
    Total PCIe bandwidth for even a Threadripper Pro platform (128 lanes of gen4 pcie) is ~250GB/s. Most other platforms have less (especially Intel, which likes to market-segment by restricting the number of pcie lanes).
    Gen5 and gen6 PCIe in theory will double this and double this again—but on a multiyear cadence at best.
    Meanwhile GPT-3 is ~300GB compressed, and model size seems to keep increasing.
    Hence: beware bandwidth bottlenecks.
    - SoerenMind 15 Nov 2021 10:17 UTC
      2 points
      0
      Parent
      My point is that, while PCIe bandwidths aren’t increasing very quickly, it’s easy to increase the number of machines you use. So you can distribute each NN layer (width-wise) across many machines, each of which adds to the total bandwidth you have.
      
      (As noted in the previous comment, you can do this with <<300GB of total GPU memory for GPT-3 with something like ZeRO-infinity)