Vladimir_Nesov comments on Vladimir_Nesov’s Shortform

Vladimir_Nesov 6 Aug 2025 15:28 UTC
32 points
−1
OpenAI’s gpt-oss-120b might be the first open weights model (implicitly) revealed to be pretrained for 100T-200T tokens. In the section “Pretraining” of the model card, it’s said that “The training run for gpt-oss-120b required 2.1 million H100-hours”, so probably this is just the GPU-time for pretraining rather than both pretraining and RLVR.

The pretraining precision is unclear, but for a model of this size FP8 is likely. Because H100-hours are mentioned, it couldn’t (usefully) be MXFP4 the model ended up with, since H100 can’t do FP4 faster than FP8 (but Blackwell can). Also, despite claims that the model was “trained with native MXFP4 precision” the model card also says “We post-trained the models with quantization of the MoE weights to MXFP4 format”, suggesting higher precision before post-training.

At 40% utilization, with 2e15 FP8 FLOP/s per H100, 2.1e6 H100-hours give 6e24 FLOPs (3.5x less than the original GPT-4, 2x more than DeepSeek-V3). The model only has 5.1B active params, so this suggests 188T tokens by 6ND rule. If it was pretrained in BF16 for some reason, that’s still 94T tokens.

For comparison, a compute optimal 5e26 model pretrained on 100K H100s from 2024 would also need 100T tokens at 850B active params (assuming MoE with 1:8 active to total param ratio, with 120 tokens/param compute optimal from Llama-3-405B’s 40 tokens/param as the dense anchor and 3x that for a 1:8 sparse MoE). And an overtrained model with fewer active params would need even more tokens. Though plausibly in both cases there is some repetition of data.

Also, this suggests that the model is 80-180x overtrained (the tokens/param multiple for compute optimal pretraining might be 5x-6x dense for the sparsity of gpt-oss-120b, so 200-240 tokens/param). Looking at isoFLOPs for Llama 3, this might incur a penalty of about 5x-10x in effective compute, turning the raw 6e24 FLOPs into effective 6e23-1e24 FLOPs (which could ask for a 65B param compute optimal dense model trained for merely 2.6T tokens). In contrast, DeepSeek-V3 is only 2x overtrained (under the same assumptions), so its 3e24 FLOPs are more straightforward.
What links here?
- Vladimir_Nesov's comment on GPT-5: The Reverse DeepSeek Moment by Zvi (18 Aug 2025 18:43 UTC; 16 points)
- Peter Wildeford 7 Aug 2025 0:56 UTC
  4 points
  0
  Parent
  this suggests that the model is 80-180x overtrained
  What is the rationale to overtrain a model this much?
  - Jacob_Hilton 7 Aug 2025 15:25 UTC
    5 points
    1
    Parent
    The model sizes were likely chosen based on typical inference constraints. Given that, they mostly care about maximizing performance, and aren’t too concerned about the compute cost, since training such small models is very affordable for them. So it’s worth going a long way into the regime of diminishing returns.
  - Vladimir_Nesov 7 Aug 2025 4:13 UTC
    3 points
    0
    Parent
    Possibly the model would’ve been too strong if it had more active params?
    
    The number of total (rather than active) params influences the speed/cost of generating tokens, but reducing it too much stops helping at some point as the size of KV caches for all requests in a batch starts dominating. Reducing the number of active params (without changing attention or the number of total params) doesn’t influence generation of tokens, but it helps with the speed/cost of processing the initial prompt (or large tool outputs), which can be important for RAG or for loading large parts of a codebase in context.
    
    So they might’ve targeted the number of total params (120B) and a level of benchmark performance, and found that 5.1B active params is when that happens. Not sure if 5.1B active params could really have been a target, but it’s a nice 6x compared to the other open weights models, if it really doesn’t destroy quality in less easily measurable ways.
- teradimich 7 Aug 2025 23:20 UTC
  2 points
  0
  Parent
  What do you think about GPT-5? Is this a GPT-4.5 scale model, but with a lot of RLVR training?
  - Vladimir_Nesov 8 Aug 2025 1:15 UTC
    12 points
    0
    Parent
    The input token batch price is $0.625, which works for a 850B active param model running in FP4 on GB200 NVL72 priced at $8 per chip-hour with 60% compute utilization (for prefill). If the cost of chip-hours is a third of the capital cost of compute equipment in the first year, and 100K chips of GB200 NVL72 cost $7bn ($5M per rack all-in, with networking), then its chip-hour should cost at least $2.66.
    
    So there is some possibility for gross margin here in principle, even though $8 per chip-hour already sounds very cheap. GCP is selling B200-hours for $11 (a4-highgpu-8g instances), though B200s are also on gpulist for $3-4. Oracle is selling actual GB200 in 4-chip instances for $16 per chip-hour, if I’m reading it right (it’s in principle possible it’s actually $4 and $16 is for the 4-chip instance as a whole, but GCP’s prices for B200 corroborate that $16 could be right for a single chip).
    
    There’s the Oct 2024 knowledge cutoff, which is later than Orion should’ve started training, but in principle this could be for mid-training that got re-applied recently, or they could’ve just redone the whole run with the learnings from GPT-4.5 and an updated pretraining dataset. Also they would’ve needed access to GB200 NVL72 to do a lot of RLVR in reasonable time if it has 6+ trillions of total params, but these racks plausibly only started working in significant numbers since about May-Jun 2025, and with all the previews GPT-5 was probably done by mid-Jul 2025 at the latest.
    
    So dunno. From my tests it seems notably better than Opus 4 at keeping many constraints in mind without getting confused, but with gpt-oss-120b being this small and yet this capable (even though it’s clearly worse than the frontier models) it’s imaginable that gpt-5-thinking could be something like a 1T-A250B MXFP4 model (with a 500 GB HBM footprint), and so could run on the 8-chip servers with lower costs (and get RLVR training there)...
- anaguma 6 Aug 2025 19:30 UTC
  1 point
  1
  Parent
  6ND rule
  
  8ND may be more accurate, since these pretraining runs usually use gradient checkpointing to reduce memory requirements.