teradimich comments on Vladimir_Nesov’s Shortform

teradimich 7 Aug 2025 23:20 UTC
2 points
0
What do you think about GPT-5? Is this a GPT-4.5 scale model, but with a lot of RLVR training?
- Vladimir_Nesov 8 Aug 2025 1:15 UTC
  12 points
  0
  Parent
  The input token batch price is $0.625, which works for a 850B active param model running in FP4 on GB200 NVL72 priced at $8 per chip-hour with 60% compute utilization (for prefill). If the cost of chip-hours is a third of the capital cost of compute equipment in the first year, and 100K chips of GB200 NVL72 cost $7bn ($5M per rack all-in, with networking), then its chip-hour should cost at least $2.66.
  
  So there is some possibility for gross margin here in principle, even though $8 per chip-hour already sounds very cheap. GCP is selling B200-hours for $11 (a4-highgpu-8g instances), though B200s are also on gpulist for $3-4. Oracle is selling actual GB200 in 4-chip instances for $16 per chip-hour, if I’m reading it right (it’s in principle possible it’s actually $4 and $16 is for the 4-chip instance as a whole, but GCP’s prices for B200 corroborate that $16 could be right for a single chip).
  
  There’s the Oct 2024 knowledge cutoff, which is later than Orion should’ve started training, but in principle this could be for mid-training that got re-applied recently, or they could’ve just redone the whole run with the learnings from GPT-4.5 and an updated pretraining dataset. Also they would’ve needed access to GB200 NVL72 to do a lot of RLVR in reasonable time if it has 6+ trillions of total params, but these racks plausibly only started working in significant numbers since about May-Jun 2025, and with all the previews GPT-5 was probably done by mid-Jul 2025 at the latest.
  
  So dunno. From my tests it seems notably better than Opus 4 at keeping many constraints in mind without getting confused, but with gpt-oss-120b being this small and yet this capable (even though it’s clearly worse than the frontier models) it’s imaginable that gpt-5-thinking could be something like a 1T-A250B MXFP4 model (with a 500 GB HBM footprint), and so could run on the 8-chip servers with lower costs (and get RLVR training there)...