This seems relatively clearly false in the case of competition programming problems. Concretely, o3 with 50 submissions beats o1 with 10k submissions. (And o1 is presumably much better than the underlying instruct model.)
I’d guess this paper doesn’t have the actual optimal methods.
All of the figures are base model equivalated between RL and not
I would expect “this paper doesn’t have the actual optimal methods” is true, this is specifically a test for PPO for in distribution actions. Concretely, there is a potential story here about PPO reinforces traces that hit in self-play, consequently, there is a sense which we would expect it to only select previously on policy actions.
But if one has enough money, you can finetune GPT models, and test that.
Also note that 10k submissions is about 2 OOM out of distribution for the charts in the paper.
Pass at inf k includes every path with nonzero probability (if there is a policy of discarding exact repeat paths).
We know that RL decreases model entropy, so the first k passes will be more different for a high variance model.
Pass at k is take best, so for normal distribution take best has EV mean+variance*log(samples).
At very large K, we would expect variance to matter more than mean.
this isn’t evidence against OP? if it’s true that RL lowers pass@k performance for sufficiently large k, we’d certainly expect o1 with 10k submissions to be weaker than base/instruct with 10k submissions.
It’s evidence to the extent that the mere fact of publishing Figure 7 (hopefully) suggests that the authors (likely knowing relevant OpenAI internal research) didn’t expect that their pass@10K result for the reasoning model is much worse than the language monkey pass@10K result for the underlying non-reasoning model. So maybe it’s not actually worse.
If I’m interpreting the paper correctly the k at which base models start beating RL’d models is a per-task number, and k can be arbitrarily high for a given task, and the 50-400 range was specifically for tasks of the type the authors chose within a narrow difficulty band.
Let’s say you have a base model which performs at 35% on 5 digit addition, and an RL’d model which performs at 99.98%. Even if the failures of the RL’d model are perfectly correlated, you’d need k=20 for base@20 to exceed the performance of fine-tuned@20. And the failures of the RL model won’t be perfectly correlated—but this paper claims that the failures of the RL model will be more correlated than the failures of the base model, and so the lines will cross eventually, and “eventually” was @50 to @400 in the tasks they tested.
But you could define a task where you pass in 10 pairs of 5 digit numbers and the model must correctly find the sum of each pair. The base model will probably succeed at this task at somewhere on the order of 0.35^10 or about 0.0003% of the time, while the RL’d model should succeed about 99.8% of the time. So for this task we’d expect k in the range of k=220,000 assuming perfectly-correlated failures in the RL model, and higher otherwise.
Also I suspect that there is some astronomically high k such that monkeys at a keyboard (i.e. “output random tokens”) will outperform base models for some tasks by the pass@k metric.
Also I suspect that there is some astronomically high k such that monkeys at a keyboard (i.e. “output random tokens”) will outperform base models for some tasks by the pass@k metric.
It would be an extreme bias-variance tradeoff, yes.
The base model will probably succeed at this task at somewhere on the order of 0.35^10 or about 0.0003% of the time, while the RL’d model should succeed about 99.8% of the time.
The interesting concept in the paper is the location of the crossover point, which seems remarkably stable (for a given task) across specific RL techniques and amount of RL training. It can be measured experimentally for a task by doing a little bit of RL training, and RL@1 performance won’t get better than that with more training, so you’re unlikely to get the RL model to succeed 99.8% of the time (at pass@1) ever unless the level of performance of the base model at the crossover point with a weak RL model was already higher than 99.8%.
Probably the crossover point for a task depends on things that can be changed (such as strength of the pretrained model, or size/relevance of the verifiable task dataset, or possibly the inference time reasoning budget). The issue isn’t for example as straightforward as losing entropy in RL policy (as a formulation of reduced exploration), since DAPO specifically addresses this issue (otherwise present in vanilla GRPO), but the pass@k plot for DAPO (Figure 7, top) barely moves (compared to other methods), in their experiment it’s even slightly worse at the crossover point.
So in the context of this paper it remains unclear how to move the plot to reach ever higher base@k performance using RL@1, higher than the ceiling of where base@k already was at the crossover point when comparing with some method at only 100-500 RL steps.
I’d guess this paper doesn’t have the actual optimal methods.
Intuitively, this shouldn’t matter much. They use some RL-on-CoTs method that works, and I expect its effects are not fundamentally different from optimal methods’. Thus, optimal methods might yield better quantitative results, but similar qualitative results: maybe they’d let elicit pass@800 capabilities instead of “just” pass@400, but it’d still be just pass@k elicitation for not-astronomical k.
In the hypothetical where the paper’s results hold, reasoning model performance at pass@k will match non-reasoning model performance with the number of samples closer to the crossover point between reasoning and non-reasoning pass@k plots. If those points for o1 and o3 are somewhere between 50 and 10K (say, at ~200), then pass@10K for o1 might be equivalent to ~pass@400 for o1′s base model (looking at Figure 2), while pass@50 for o3 might be equivalent to ~pass@100 for its base model (which is probably different from o1′s base model).
So the difference of 200x (10K vs. 50) in the number of samples becomes much smaller when comparing performance of the base models. For GPT-4o vs. GPT-4.1, a difference of ~4x in the number of samples doesn’t seem too strange. There’s also the possibility of distillation from a reasoning variant of GPT-4.5, which could have an even larger effect on pass@k performance at low k (Figure 6, right).
This seems relatively clearly false in the case of competition programming problems. Concretely, o3 with 50 submissions beats o1 with 10k submissions. (And o1 is presumably much better than the underlying instruct model.)
I’d guess this paper doesn’t have the actual optimal methods.
o3 has a different base model (presumably).
All of the figures are base model equivalated between RL and not
I would expect “this paper doesn’t have the actual optimal methods” is true, this is specifically a test for PPO for in distribution actions. Concretely, there is a potential story here about PPO reinforces traces that hit in self-play, consequently, there is a sense which we would expect it to only select previously on policy actions.
But if one has enough money, you can finetune GPT models, and test that.
Also note that 10k submissions is about 2 OOM out of distribution for the charts in the paper.
Pass at inf k includes every path with nonzero probability (if there is a policy of discarding exact repeat paths).
We know that RL decreases model entropy, so the first k passes will be more different for a high variance model.
Pass at k is take best, so for normal distribution take best has EV mean+variance*log(samples).
At very large K, we would expect variance to matter more than mean.
this isn’t evidence against OP? if it’s true that RL lowers pass@k performance for sufficiently large k, we’d certainly expect o1 with 10k submissions to be weaker than base/instruct with 10k submissions.
It’s evidence to the extent that the mere fact of publishing Figure 7 (hopefully) suggests that the authors (likely knowing relevant OpenAI internal research) didn’t expect that their pass@10K result for the reasoning model is much worse than the language monkey pass@10K result for the underlying non-reasoning model. So maybe it’s not actually worse.
If I’m interpreting the paper correctly the
k
at which base models start beating RL’d models is a per-task number, andk
can be arbitrarily high for a given task, and the 50-400 range was specifically for tasks of the type the authors chose within a narrow difficulty band.Let’s say you have a base model which performs at 35% on 5 digit addition, and an RL’d model which performs at 99.98%. Even if the failures of the RL’d model are perfectly correlated, you’d need k=20 for base@20 to exceed the performance of fine-tuned@20. And the failures of the RL model won’t be perfectly correlated—but this paper claims that the failures of the RL model will be more correlated than the failures of the base model, and so the lines will cross eventually, and “eventually” was @50 to @400 in the tasks they tested.
But you could define a task where you pass in 10 pairs of 5 digit numbers and the model must correctly find the sum of each pair. The base model will probably succeed at this task at somewhere on the order of 0.35^10 or about 0.0003% of the time, while the RL’d model should succeed about 99.8% of the time. So for this task we’d expect k in the range of k=220,000 assuming perfectly-correlated failures in the RL model, and higher otherwise.
Also I suspect that there is some astronomically high k such that monkeys at a keyboard (i.e. “output random tokens”) will outperform base models for some tasks by the pass@k metric.
It would be an extreme bias-variance tradeoff, yes.
The interesting concept in the paper is the location of the crossover point, which seems remarkably stable (for a given task) across specific RL techniques and amount of RL training. It can be measured experimentally for a task by doing a little bit of RL training, and RL@1 performance won’t get better than that with more training, so you’re unlikely to get the RL model to succeed 99.8% of the time (at pass@1) ever unless the level of performance of the base model at the crossover point with a weak RL model was already higher than 99.8%.
Probably the crossover point for a task depends on things that can be changed (such as strength of the pretrained model, or size/relevance of the verifiable task dataset, or possibly the inference time reasoning budget). The issue isn’t for example as straightforward as losing entropy in RL policy (as a formulation of reduced exploration), since DAPO specifically addresses this issue (otherwise present in vanilla GRPO), but the pass@k plot for DAPO (Figure 7, top) barely moves (compared to other methods), in their experiment it’s even slightly worse at the crossover point.
So in the context of this paper it remains unclear how to move the plot to reach ever higher base@k performance using RL@1, higher than the ceiling of where base@k already was at the crossover point when comparing with some method at only 100-500 RL steps.
Intuitively, this shouldn’t matter much. They use some RL-on-CoTs method that works, and I expect its effects are not fundamentally different from optimal methods’. Thus, optimal methods might yield better quantitative results, but similar qualitative results: maybe they’d let elicit pass@800 capabilities instead of “just” pass@400, but it’d still be just pass@k elicitation for not-astronomical k.
Not strongly convinced of that, though.
In the hypothetical where the paper’s results hold, reasoning model performance at pass@k will match non-reasoning model performance with the number of samples closer to the crossover point between reasoning and non-reasoning pass@k plots. If those points for o1 and o3 are somewhere between 50 and 10K (say, at ~200), then pass@10K for o1 might be equivalent to ~pass@400 for o1′s base model (looking at Figure 2), while pass@50 for o3 might be equivalent to ~pass@100 for its base model (which is probably different from o1′s base model).
So the difference of 200x (10K vs. 50) in the number of samples becomes much smaller when comparing performance of the base models. For GPT-4o vs. GPT-4.1, a difference of ~4x in the number of samples doesn’t seem too strange. There’s also the possibility of distillation from a reasoning variant of GPT-4.5, which could have an even larger effect on pass@k performance at low k (Figure 6, right).
If true, would this imply you want a base model to generate lots of solutions and a reasoning model to identify the promising ones and train on those?