Vladimir_Nesov comments on Vladimir_Nesov’s Shortform

Vladimir_Nesov 16 Sep 2025 18:00 UTC
31 points
0
By 2027-2028, pretraining compute might get an unexpected ~4x boost in price-performance above trend. Nvidia Rubin NVL144 CPX will double the number of compute dies per rack compared to the previously announced Rubin NVL144, and there is a May 2025 paper demonstrating BF16 parity of Nvidia’s NVFP4 4-bit block number format.

The additional chips^[1] in the NVL144 CPX racks don’t introduce any overhead to the scale-up networking of the non-CPX chips (they mostly just increase the power consumption), and they don’t include HBM, thus it’s in principle an extremely cost-effective increase in the amount of compute (if it can find high utilization). It’s not useful for decoding/generation (output tokens), but it can be useful for pretraining (as well as the declared purpose of prefill, input token processing during inference). Not being included in a big scale-up world could in principle be a problem early in a large pretraining run, because it forces larger batch sizes, but high-granularity MoE (where many experts are active) can oppose that, and also merely getting into play a bit later in a pretraining run once larger batch sizes are less of a problem might be impactful enough.

Previously only FP8 looked plausible as a pretraining number format, but now there is a new paper that describes a better block number format and a pretraining process that plausibly solve the major issues with using FP4. NVFP4 uses a proper FP8 number (rather than a pure exponent, a power of 2) as the scaling factor that multiplies the 4-bit numbers within a block, and the number blocks are organized as small squares rather than parts of lines in the matrix. The pretraining method has a new kind of “cooldown” phase where the training is finished in BF16, after using NVFP4 for most of the training run. This proves sufficient to arrive at the same loss as pure BF16 pretraining (Figure 6b). Using this to scale the largest attempted training run seems risky, but in any case the potential to make use of this boost in price-performance at some point, if a bit later, won’t be going away.

If pretraining had to remain in BF16, the on-trend improvement with Rubin (over GB200) that moves to a 3nm process might’ve been about 2x per reticle-sized compute die. But there was already an impactful change where the scale-up networking part of the Blackwell compute dies was extracted into specialized IO chiplets in Rubin, freeing up area on the compute dies for the actual compute, potentially affecting all precisions. In GB200, FP4 performance is 2x the FP8 performance, which is in turn 2x the BF16 performance. But in GB300, the FP4 performance improves by 1.5x over GB200 (from 10e15 FLOP/s per chip/package to 15e15 FLOP/s), likely by cannibalizing other things for FP4. And FP8 in Rubin improves over FP8 of GB200 by 3.3x (from 5e15 FLOP/s per chip/package to 17e15 FLOP/s), while “inference FP4” is claimed to be 50e15 FLOP/s per chip/package, which is likely meant to be the never-useful sparse compute performance, in contrast to the actually-useful but not explicitly announced dense “training FP4″, which has always been 2x lower before, so probably the actual FP4 performance relevant for NVFP4 pretraining is 25e15 FLOP/s per chip/package, 2.5x more than for GB200 and 1.5x more than for GB300.

The Rubin NVL144 CPX announcement presentation includes some details suggesting slightly more performance than that. A Rubin CPX compute die is claimed to have 30e15 FP4 FLOP/s (at 21:31 in the video). Anchoring to the above estimate of 25e15 FLOP/s per package with 2 compute dies, this must be the sparse compute performance, so the dense performance would likely be 15e15 FLOP/s per compute die, about 20% higher than for the non-CPX compute dies. For the whole rack, this gives 4e18 FLOP/s, 5.5x more than the 720e15 FP4 FLOP/s of GB200 NVL72. This is partially corroborated by the explicit claim that the total NVFP4 performance of a Rubin NVL144 CPX rack is 8e18 FLOP/s (at 24:28 in the video), which I’m interpreting as referring to sparse compute performance, which is probably 2x the more relevant dense performance. (SemiAnalysis estimate is 5.3e18 dense FP4 FLOP/s for some reason, perhaps they know that the difference between sparse and dense is not 2x for Rubin.)

So the total increase in dense FP4 performance potentially relevant for pretraining using Rubin NVL144 CPX over FP8 using GB200 NVL72 might be about 11x (72x 5e15 FP8 FLOP/s for GB200, which is 0.36e18 FLOP/s, changes to 72x 25e15 FP4 FLOP/s for non-CPX Rubin chips plus 144x 15e15 FP4 FLOP/s for Rubin CPX chips, which is 4e18 FLOP/s in total). The racks are still Oberon (72 non-CPX chips/packages in a rack-sized scale-up world of the same size, with the same number of chips included in it), so the cost might only change slightly, maybe 1.5x (there are still 2x more compute dies). Which is 3.7x more price-performance than the ~2x that the mere change in semi process would predict (Moore’s law of price-performance). (Or 4.9x if we follow the SemiAnalysis estimate of dense 5.3e18 FP4 FLOP/s for a Rubin NVL144 CPX rack.)
1. ↩︎
  A GB200 NVL72 rack has 72 chips/packages, each with 2 compute dies. Rubin NVL144 CPX has 72 non-CPX chips/packages, each with 2 compute dies, and an additional 144 CPX chips, each with 1 compute die, for the total of 288 compute dies of both kinds, 2x more than the 144 compute dies in a GB200 NVL72 rack.
- leogao 17 Sep 2025 15:49 UTC
  11 points
  0
  Parent
  in general publicly known training techniques are behind sota, so this should be taken into account.
- romeo 18 Sep 2025 3:30 UTC
  5 points
  0
  Parent
  Thoughts on whether the >10x lower chip-to-chip interconnect from the CPX chips (PCIe 6.0x16′s 128GB/s unidirectional vs. NVLink 5′s 1.8TB/s bidirectional) will be a bottleneck blocking them from being that useful in pre-training?
  - Vladimir_Nesov 18 Sep 2025 17:02 UTC
    3 points
    0
    Parent
    If the pretraining system (built in 2027) is about 2 GW, that’s 5K Rubin NVL144 CPX racks, or 8e28 FP4 FLOPs^[1] in 4 months at 30% utilization. At 120 tokens/param, this is enough for 10T active params in a compute optimal MoE model. With 150 layers, 8 active experts per layer, and a GLU nonlinearity (3 matrices per FFN block), this gives 50Kx50K matrices. Such transformers would be too large for efficiently generating output tokens on Rubin NVL144 (even in FP4), but might be analogous to GPT-4.5 in that the immediately following hardware that is Rubin Ultra NVL576 can efficiently generate output tokens for them. In any case, 5T active params and 20T total seems OK for Rubin NVL144 to generate output tokens (10 TB of HBM out of the 20 TB a rack will have), which gives 37Kx37K matrices.
    
    A Rubin CPX compute die produces 20e15 FP4 FLOP/s^[2]. For multiplying square matrices with side $N$ it needs $2 N^{3}$ FLOPs and to exchange $3 N^{2} / 2$ bytes with memory. At 2 TB/s GDDR7 bandwidth, this needs $N$ at least 7500. For processing an FFN block of 3 square matrices with side $N$ , it needs $6 N^{3}$ FLOPs and to exchange $2 N^{2} / 2$ bytes on the network in both directions in total. At 0.2 TB/s CX-9 bidirectional bandwidth, this needs $N$ at least 17K. So there’s even enough for an off-by-2x mistake in these estimates, various matrices actually getting non-square shapes, or models being somewhat smaller.
    
    ↩︎
    The SemiAnalysis estimate of 5.3e18 FLOP/s per Rubin NVL144 CPX rack is indeed based on a different ratio of sparse to dense compute, they are claiming it’s 3:2 for Rubin. I didn’t yet search for a source for this, but in any case this is in the article and I missed it on first reading, so didn’t recall it when my own estimate based on the 2:1 sparse to dense ratio failed to match theirs.
    
    ↩︎
    As in the previous footnote, this is what the announced 30e15 FP4 FLOP/s become after using the 3:2 sparse to dense compute ratio, rather than the 2:1 ratio.