Vladimir_Nesov comments on Vladimir_Nesov’s Shortform

Vladimir_Nesov 18 May 2026 16:07 UTC
5 points
0
Suppose you have a lot of compute, but 25x less unique data than would be compute optimal to use in pretraining. A May 2026 paper takes some measurements suggesting that the best loss is achieved by using a 5x bigger model (than would be compute optimal) and training it for 5 epochs of repeated data (see Figure 5, left and Table 2).

The measurements are taken at around 1e19 FLOPs of compute, so that’s not very convincing about what happens around the more relevant 5e29 FLOPs. But training with repetition for about 5 epochs is a familiar anchor when the data is scarce, so it seems reasonable. The new thing is that this suggests scaling the model size proportionally to the number of repetitions of unique data. The paper doesn’t give experimental data for this, but perhaps if there’s only 4x less unique data than would be compute optimal, the thing to do is to use a 2x bigger model and train for 2 epochs.

For MoE models, the compute optimal tokens/param ratio might be 120 for 8x sparsity and 240 for 30x sparsity ^[1] . Applied to a 5e29 FLOPs model targeting a scale-up system with enough space for any number of total params (so 30x sparsity), the compute optimal number of active params would be 19T (out of 560T total), trained for 4,500T tokens. Finding 180T unique tokens (which is 25x less) seems borderline reasonable, if half of them are non-text tokens.

The implied advice of the paper, if transferred without change over 10 orders of magnitude, is to train a 93T active param model (2,800T total params) for 5 epochs of repetition of the 180T unique tokens. The 8x Kyber Rubin Ultra scale-up system (possible buildout in 2028) has 1,180 TB of HBM, so this is borderline practical for inference and RLVR (with 3-4 scale-up systems in pipeline parallelism per inference deployment and NVFP4 FFNs). Though that’s too early for 5e29 FLOPs of pretraining, and 8x Kyber Feynman is also more likely to actually be a major part of the buildout (in 2029 or 2030), probably with more HBM.

The output token cost scales with the square root of the number of active params (via KV cache per token and model dimension), so could be about 10x higher than for a 800B active param model (which is maybe $25 per 1M tokens, with 1M token sequences). The input token cost scales with the number of active params, so could be 100x higher than the cost for a 800B active param model, which is maybe $0.5 per 1M tokens with NVFP4 FFNs. With zero gross margin for output tokens with 1M token sequences (more with shorter sequences) and 50% gross margin for input tokens, the 93T active param model pretrained for 5e29 FLOPs might be priced at $100/$250 per 1M input/output tokens at Blackwell prices for compute (as a pricing anchor, it’s won’t be able to actually run on Blackwell). Maybe this gets 4x/2x cheaper at Feynman prices, $25/$125 per 1M input/output tokens. Which is exactly Mythos’s API price, so even the mind-bogglingly giant models of 2031-2033 might remain relatively “cheap”, and the token efficiency will be higher, giving a lower cost per task.
1. ↩︎
  Based on this Jan 2025 paper, the compute optimal ratio of tokens per active param is 3x higher for an MoE model with 8x sparsity compared to a dense model, and 6x higher for an MoE model with 30x sparsity, see Figure 11 and Figure 12, left. Based on the Jul 2024 Llama 3 405B report, the compute optimal ratio for a dense model is about 40 tokens/param at 4e25 FLOPs, see Figure 2 and Figure 3. Putting these anchors together, we get 120 and 240 tokens/param respectively.