they write: “We then apply RL training on the fine-tuned model until it achieves convergence on reasoning tasks.”
Ah, I failed to take a note of that when reading the paper. My takeaway was the opposite. In Figure 2 for R1-Zero, the first impression is convergence, both from saturation of the benchmark, and in the graph apparently leveling off. But if replotted in log-steps instead of linear steps, there isn’t even any leveling off for pass@1, despite near-saturation of the benchmark for cons@16: accuracy for pass@1 is 0.45 after 2K steps, 0.55 (+0.10) after 4K steps, then 0.67 (+0.12) after 8K steps, it just keeps going up by +0.10 every doubling in training steps. And the plots-that-don’t-level-off in the o1 post are in log-steps. Also, the average number of reasoning steps for R1-Zero in Figure 3 is a straight line that’s probably good for something if it goes further up. So I guess I might disagree with the authors even, in characterizing step 10K as “at convergence”, though your quote is about R1 rather than R1-Zero for which there are plots in the paper...
your analysis of GPT-5--which is worrying for short-term scaling
Well, I mostly argued about naming, not facts, though the recent news seem to be suggesting that the facts are a bit better than I expected only a month ago, namely 1 GW training systems might only get built in 2026 rather than in 2025, except possibly at Google. And as a result even Google might feel less pressure to actually get this done in 2025.
Ah, I failed to take a note of that when reading the paper. My takeaway was the opposite. In Figure 2 for R1-Zero, the first impression is convergence, both from saturation of the benchmark, and in the graph apparently leveling off. But if replotted in log-steps instead of linear steps, there isn’t even any leveling off for pass@1, despite near-saturation of the benchmark for cons@16: accuracy for pass@1 is 0.45 after 2K steps, 0.55 (+0.10) after 4K steps, then 0.67 (+0.12) after 8K steps, it just keeps going up by +0.10 every doubling in training steps. And the plots-that-don’t-level-off in the o1 post are in log-steps. Also, the average number of reasoning steps for R1-Zero in Figure 3 is a straight line that’s probably good for something if it goes further up. So I guess I might disagree with the authors even, in characterizing step 10K as “at convergence”, though your quote is about R1 rather than R1-Zero for which there are plots in the paper...
Well, I mostly argued about naming, not facts, though the recent news seem to be suggesting that the facts are a bit better than I expected only a month ago, namely 1 GW training systems might only get built in 2026 rather than in 2025, except possibly at Google. And as a result even Google might feel less pressure to actually get this done in 2025.