I saw the Nvidia paper, I don’t think the data it presents makes that case. In particular, their “intermediate” checkpoint is too far away from the base model to correctly reference the crossover point (where the base model pass@k intersects the early RLVR pass@k). And the base model choice is strange for a study like this (it already has finetuning on DeepSeek-R1 traces in it, so the base model proper is mixed up with elicitation through R1 traces, when comparing with elicitation through subsequent RLVR).
In some of the plots, the intersection point isn’t visible, and mostly the “final” checkpoint seems to get worse than the “intermediate” checkpoint on pass@k plots at very high k, confirming rather than opposing the point of the Yue et al. paper (regarding the crossover point).
The fact that they’ve plotted pass@16 in Figure 1 as illustrative of the overall framing of the paper suggests that they aren’t grappling with the correct point, because if k=16 is earlier than the crossover point, then of course pass@16 performance will keep increasing. The question is whether it’ll ever exceed the performance at the crossover point.
(Of course, for sufficiently simple problems, RL works and can train a model to do things that the base model can’t do at all. And in principle RL should be able to do this in general, that’s the promise of RL. The question is whether it works for interesting problems that can’t be as easily solved with RL directly, using current methods for doing RLVR. If not, it can’t just be directly scaled to the moon within 1-2 years.)
I saw the Nvidia paper, I don’t think the data it presents makes that case. In particular, their “intermediate” checkpoint is too far away from the base model to correctly reference the crossover point (where the base model pass@k intersects the early RLVR pass@k). And the base model choice is strange for a study like this (it already has finetuning on DeepSeek-R1 traces in it, so the base model proper is mixed up with elicitation through R1 traces, when comparing with elicitation through subsequent RLVR).
In some of the plots, the intersection point isn’t visible, and mostly the “final” checkpoint seems to get worse than the “intermediate” checkpoint on pass@k plots at very high k, confirming rather than opposing the point of the Yue et al. paper (regarding the crossover point).
The fact that they’ve plotted pass@16 in Figure 1 as illustrative of the overall framing of the paper suggests that they aren’t grappling with the correct point, because if k=16 is earlier than the crossover point, then of course pass@16 performance will keep increasing. The question is whether it’ll ever exceed the performance at the crossover point.
(Of course, for sufficiently simple problems, RL works and can train a model to do things that the base model can’t do at all. And in principle RL should be able to do this in general, that’s the promise of RL. The question is whether it works for interesting problems that can’t be as easily solved with RL directly, using current methods for doing RLVR. If not, it can’t just be directly scaled to the moon within 1-2 years.)
Thank you for the quick reply.