Incidentally, the pass@k plots before and after RLVR (with an early intersection point) were already shown more than a year ago in the GRPO paper from Feb 2024 (Figure 7), with discussion to the same effect (Section 5.2.2):
These findings indicate that RL enhances the model’s overall performance by rendering the output distribution more robust, in other words, it seems that the improvement is attributed to boosting the correct response from TopK rather than the enhancement of fundamental capabilities.
I think the main challenge to this result is the way pass@10K performance for reasoning models keeps getting published in training papers, which would end up quite embarrassing if it turns out that pass@10K for the base models was significantly better all along. Namely, OpenAI’s o1 vs. o3 paper (Figure 7, h/t ryan_greenblatt); DeepSeek-Prover-V2 and Kimina-Prover (h/t Weaverzhu).
Incidentally, the pass@k plots before and after RLVR (with an early intersection point) were already shown more than a year ago in the GRPO paper from Feb 2024 (Figure 7), with discussion to the same effect (Section 5.2.2):
I think the main challenge to this result is the way pass@10K performance for reasoning models keeps getting published in training papers, which would end up quite embarrassing if it turns out that pass@10K for the base models was significantly better all along. Namely, OpenAI’s o1 vs. o3 paper (Figure 7, h/t ryan_greenblatt); DeepSeek-Prover-V2 and Kimina-Prover (h/t Weaverzhu).