Vladimir_Nesov comments on 10x more training compute = 5x greater task length (kind of)

Vladimir_Nesov 14 Jul 2025 4:13 UTC
8 points
1
In MoE, each expert only consumes a portion of the tokens, maybe 8x-32x fewer than there are tokens in total. When decoding, each sequence only contributes 1 token without speculative decoding, or maybe 8 tokens with it (but then later you’d be throwing away the incorrectly speculated tokens). When you multiply two $N$ x $N$ square matrices of 16 bit numbers, you need to read $4 N^{2}$ bytes from HBM, perform $2 N^{3}$ FLOPs, and write back $2 N^{2}$ bytes of the result. Which means you are performing $N / 3$ BF16 FLOPs per byte read/written to HBM. For an H200, HBM bandwidth is 4.8 TB/s, while BF16 compute is 1e15 FLOP/s. So to feed the compute with enough data, you need $N$ to be at least 600, probably 1K in practice.

So to feed the compute in a MoE model with 1:8-1:32 sparsity, you need to be processing 8K-32K tokens at a time. This isn’t too much of a problem for pretraining or prefill, since you work with all tokens of all the sequences in a batch simultaneously. But for decoding, you only have 1-8 tokens per sequence (at its end, currently being generated), which means 1K-4K sequences (with speculative decoding) or the full 8K-32K sequences (without) that could arrive to a given expert located at a particular physical chip. Each sequence might need 10 GB of KV cache, for the total of 10-40 TB or 80-320 TB. An 8-chip H200 node has 1.1 TB of HBM, so that’s a lot of nodes, and the activation vectors would need to travel between them to find their experts, the same 10-40 or 80-320 hops per either 4 tokens or 1 token of progress (which tops out at the number of layers, say 80 layers). Each hop between nodes might need to communicate 50 KB of activation vectors per token, that is 0.4-1.6 GB (which is the same with speculative decoding and without), let’s say 1 GB. With 8x400Gbps bandwidth, it’d take 2.5 ms to transmit (optimistically). In case of speculative decoding, that’s 25-100 ms per token in total (over 10-40 hops), and without speculative decoding 200 ms per token. Over 50K token long sequences, this takes 0.1-0.3 hours with speculative decoding (assuming 4 tokens are guessed correctly each step of decoding on average), or 4 hours without. At 40% utilization with 250B active params, it’d take 0.4 hours to compute with speculative decoding (including the discarded tokens), and 0.2 hours without. So maybe there is still something to this in principle, with speculative decoding and more frugal KV cache. But this is probably not what will be done in practice. Also, there are only 4.3K steps of 0.5 hours in 3 months (for RLVR), which is plausibly too few.

Instead, we don’t worry about feeding the compute on each chip during decoding and multiply thin matrices with very few tokens at each expert. Which means all the time is spent moving the data from HBM. With a 2T total param FP8 model we’d need maybe 4 nodes to fit it, and there will be enough space for 200 sequences. Passing all of HBM through the chips will take about 25 ms, which translates to 10 tokens per second without speculative decoding, or 40 tokens per second with it (which comes out to $2-9 per 1M tokens at $2 per H200-hour). At 40 tokens per second over 200 sequences on 4 nodes of 8 chips each, in a second a 250B active parameter model would use 4e15 FLOPs and could get 32e15 BF16 FLOPs at full compute utilization, so we get 12% compute utilization with speculative decoding, about 3x-4x lower than the proverbial 40% of pretraining. Though it’s still 0.3 hours per RLVR step, of which there are only 6.4K in 3 months. So there could be reason to leave HBMs half-empty and thereby make decoding 2x faster.
What links here?
- Vladimir_Nesov's comment on Aaron_Scher’s Shortform by Aaron_Scher (21 Aug 2025 18:01 UTC; 5 points)