romeo comments on Vladimir_Nesov’s Shortform

romeo 18 Sep 2025 3:30 UTC
5 points
0
Thoughts on whether the >10x lower chip-to-chip interconnect from the CPX chips (PCIe 6.0x16′s 128GB/s unidirectional vs. NVLink 5′s 1.8TB/s bidirectional) will be a bottleneck blocking them from being that useful in pre-training?
- Vladimir_Nesov 18 Sep 2025 17:02 UTC
  3 points
  0
  Parent
  If the pretraining system (built in 2027) is about 2 GW, that’s 5K Rubin NVL144 CPX racks, or 8e28 FP4 FLOPs^[1] in 4 months at 30% utilization. At 120 tokens/param, this is enough for 10T active params in a compute optimal MoE model. With 150 layers, 8 active experts per layer, and a GLU nonlinearity (3 matrices per FFN block), this gives 50Kx50K matrices. Such transformers would be too large for efficiently generating output tokens on Rubin NVL144 (even in FP4), but might be analogous to GPT-4.5 in that the immediately following hardware that is Rubin Ultra NVL576 can efficiently generate output tokens for them. In any case, 5T active params and 20T total seems OK for Rubin NVL144 to generate output tokens (10 TB of HBM out of the 20 TB a rack will have), which gives 37Kx37K matrices.
  
  A Rubin CPX compute die produces 20e15 FP4 FLOP/s^[2]. For multiplying square matrices with side $N$ it needs $2 N^{3}$ FLOPs and to exchange $3 N^{2} / 2$ bytes with memory. At 2 TB/s GDDR7 bandwidth, this needs $N$ at least 7500. For processing an FFN block of 3 square matrices with side $N$ , it needs $6 N^{3}$ FLOPs and to exchange $2 N^{2} / 2$ bytes on the network in both directions in total. At 0.2 TB/s CX-9 bidirectional bandwidth, this needs $N$ at least 17K. So there’s even enough for an off-by-2x mistake in these estimates, various matrices actually getting non-square shapes, or models being somewhat smaller.
  ↩︎
  The SemiAnalysis estimate of 5.3e18 FLOP/s per Rubin NVL144 CPX rack is indeed based on a different ratio of sparse to dense compute, they are claiming it’s 3:2 for Rubin. I didn’t yet search for a source for this, but in any case this is in the article and I missed it on first reading, so didn’t recall it when my own estimate based on the 2:1 sparse to dense ratio failed to match theirs.
  
  ↩︎
  As in the previous footnote, this is what the announced 30e15 FP4 FLOP/s become after using the 3:2 sparse to dense compute ratio, rather than the 2:1 ratio.