Thanks! I just read over it and assuming I understood correctly, this bottleneck primarily happens for “small” operations like layer normalization and softlax, and not for large matrix multiples. In addition, these small operations are still the minority of runtime (40% in their case). So I think this is still consistent with my analysis, which assumes various things will creep in to keep GPU utilization around 40%, but that they won’t ever drive it to (say) 10%. Is this correct or have I misunderstood the nature of the bottleneck?
Edit: also maybe we’re just miscommunicating—I definitely don’t think CPU->HBM is a bottleneck, it’s instead the time to load from HBM which sounds the same as what you said. Unless I misread the A100 specs, that comes out to 1.5TB/s, which is the number I use throughout.
Thanks! I just read over it and assuming I understood correctly, this bottleneck primarily happens for “small” operations like layer normalization and softlax, and not for large matrix multiples. In addition, these small operations are still the minority of runtime (40% in their case). So I think this is still consistent with my analysis, which assumes various things will creep in to keep GPU utilization around 40%, but that they won’t ever drive it to (say) 10%. Is this correct or have I misunderstood the nature of the bottleneck?
Edit: also maybe we’re just miscommunicating—I definitely don’t think CPU->HBM is a bottleneck, it’s instead the time to load from HBM which sounds the same as what you said. Unless I misread the A100 specs, that comes out to 1.5TB/s, which is the number I use throughout.