Josh You comments on Vladimir_Nesov’s Shortform

Josh You 29 Apr 2026 16:26 UTC
1 point
0
We don’t know much about how long it takes for a new base model to get “up to speed” in post-training these days. The existence of Mythos, which was already much more capable than Opus 4.6 by late February, suggests to me that this time window has compressed compared to a year ago.
Note: DeepSeek-V4′s final checkpoint wasn’t even RL’d at all. They did on-policy distillation from RL-trained specialist checkpoints to produce the final model. 5.5 could have similarly been (partially) distilled from advanced post-trained checkpoints of 5.4.
- Vladimir_Nesov 29 Apr 2026 18:30 UTC
  2 points
  0
  Parent
  My argument doesn’t involve saying they didn’t have enough time to RLVR it a lot yet. It’s about evidence for what “iterative deployment” means given all the GPT-5.x releases, and what it then suggests about GPT-5.5 and subsequent releases.
  
  In principle, GPT-5.5 could be RLVRed GPT-4.5, and in principle OpenAI had 100K GB200 NVL72s since maybe summer 2025. But bhalstead’s plot suggests a knowledge cutoff at the very end of 2024 (in the original pretrain, with no significantly later mid-training), which is likely too late for GPT-4.5, and the GB200 NVL72s probably weren’t ready for a while in sufficient numbers, at least for efficiently inferencing large models.
  
  Another possibility is a new pretrain made sometime in 2025 on H100/H200/B200, which wouldn’t need to wait for GB200 NVL72s, and then they had maybe at least 6 months with enough GB200 NVL72s to experiment with RLVRing it, even if not yet enough to deploy it as a frontier model. The datapoint of apparently still doing RL for GPT-5.5 in Mar 2025 isn’t evidence that this work only started very recently, as last touches of RL would happen before a release in any case.
  
  Mythos doesn’t obviously mean it takes little time to RL a large model, it could’ve been pretrained at any point in 2025 and RLed together with Opus 4.5 or shortly after, once Trainium 2 Ultra racks (or maybe some TPUs) were available for that.
  
  DeepSeek-V4′s final checkpoint wasn’t even RL’d at all. They did on-policy distillation from RL-trained specialist checkpoints to produce the final model.
  
  Sure, but RLVR still needs to happen for something, even if not for the final model. If it only happens for smaller models, where it’s more stable and doesn’t need good/scarce/unfamiliar hardware, the results after OPD might be notably worse than if it’s for same-sized models. DeepSeek-V4 paper doesn’t disclose the nature of the teacher models, and how the quality of the result depends on it.
  
  The possibility of OPD from GPT-5.4 just makes very fast post-training of GPT-5.5 more plausible than if it needed to be RLVRed directly, but probably resulting in inferior quality compared to what RLVRing of models based on GPT-5.5′s pretrain could achieve, either directly or via OPD from multiple RLVRed teachers also based on GPT-5.5′s pretrain.