Out of domain (i.e. on a different math benchmark) the RLed model does better at pass@256, especially when using algorithms like RLOO and Reinforce++. If there is a crossover point it is in the thousands. (Figure 7)
This seems critically important. Production models are RLed on hundreds to thousands of benchmarks.
We should also consider that, well, this result just doesn’t pass the sniff test given what we’ve seen RL models do. o3 is a lot better than o1 in a way which suggests that RL budgets do scale heavily with xompute, and o3 if anything is better at scaling up in a Pass@N way (o3 is reported to be fully parallelizable, capable of scaling up to $1000s of compute).
Something is up here. Maybe it’s lack of test-time scaling, maybe OpenAI really do have a secret RL algorithm (nobody else has demonstrates the capability to scale up test-time compute in quite in the way that o3 can). Maybe the authors just did it wrong. Maybe the authors didn’t do enough RL (again, we know o3 used a lot of RL compute; the authors here only did 100s of steps)
Overall I don’t buy the conclusions of that paper.
o3 is a lot better than o1 in a way which suggests that RL budgets do scale heavily with xompute, and o3 if anything is better at scaling up in a Pass@N way (o3 is reported to be fully parallelizable, capable of scaling up to $1000s of compute).
o3 may also have a better base model. o3 could be worse at pass@n for high n relative to its base model than o1 is relative to its base model, while still being better than o1.
I don’t think you need very novel RL algorithms for this either—in the paper, Reinforce++ still does better for pass@256 in all cases. For very high k, pass@k being higher for the base model may just imply that the base model has a broader distribution to sample from, while at lower k the RL’d models benefit from higher reliability. This would imply that it’s not a question of how to do RL such that the RL model is always better at any k, but how to trade off reliability for a more diverse distribution (and push the Pareto frontier ahead).
o3 is reported to be fully parallelizable, capable of scaling up to $1000s of compute
If you’re referring to the ARC-AGI results, it was just pass@1024, for a nontrivial but not startling jump (75.7% to 87.5%). About the same ballpark as in the paper, plus we don’t actually know how much better its pass@1024 was than its pass@256.
The costs aren’t due to an astronomical k, but due to it writing a 55k-token novel for each attempt plus high $X/million output tokens. (Apparently the revised estimate is $600/million??? It was $60/million initially.)
(FrontierMath was pass@1. Though maybe they used consensus@k instead (outputting the most frequent answer out of k, with only one “final answer” passed to the task-specific verifier) or something.)
If you’re referring to the ARC-AGI results, it was just pass@1024
It was sampling 1024 reasoning traces, but I don’t expect it was being scored as pass@1024. An appropriate thing to happen there would be best-of-k or majority voting, not running reasoning 1024 times and seeing if at least one result happens to be correct.
We should also consider that, well, this result just doesn’t pass the sniff test given what we’ve seen RL models do.
FWIW, I interpret the paper to be making a pretty narrow claim about RL in particular. On the other hand, a lot of the production “RL models” we have seen may not be pure RL. For instance, if you wanted to run a similar test to this paper on DeepSeek-V3+, you would compare DeepSeek-V3 to DeepSeek-R1-Zero (pure RL diff, according to the technical report), not to DeepSeek-R1 (trained with a hard-to-follow mix of SFT and RL). R1-Zero is a worse model than R1, sometimes by a large margin.
This seems critically important. Production models are RLed on hundreds to thousands of benchmarks.
We should also consider that, well, this result just doesn’t pass the sniff test given what we’ve seen RL models do. o3 is a lot better than o1 in a way which suggests that RL budgets do scale heavily with xompute, and o3 if anything is better at scaling up in a Pass@N way (o3 is reported to be fully parallelizable, capable of scaling up to $1000s of compute).
Something is up here. Maybe it’s lack of test-time scaling, maybe OpenAI really do have a secret RL algorithm (nobody else has demonstrates the capability to scale up test-time compute in quite in the way that o3 can). Maybe the authors just did it wrong. Maybe the authors didn’t do enough RL (again, we know o3 used a lot of RL compute; the authors here only did 100s of steps)
Overall I don’t buy the conclusions of that paper.
o3 may also have a better base model. o3 could be worse at pass@n for high n relative to its base model than o1 is relative to its base model, while still being better than o1.
I don’t think you need very novel RL algorithms for this either—in the paper, Reinforce++ still does better for pass@256 in all cases. For very high k, pass@k being higher for the base model may just imply that the base model has a broader distribution to sample from, while at lower k the RL’d models benefit from higher reliability. This would imply that it’s not a question of how to do RL such that the RL model is always better at any k, but how to trade off reliability for a more diverse distribution (and push the Pareto frontier ahead).
If you’re referring to the ARC-AGI results, it was just pass@1024, for a nontrivial but not startling jump (75.7% to 87.5%). About the same ballpark as in the paper, plus we don’t actually know how much better its pass@1024 was than its pass@256.
The costs aren’t due to an astronomical k, but due to it writing a 55k-token novel for each attempt plus high $X/million output tokens. (Apparently the revised estimate is $600/million??? It was $60/million initially.)
(FrontierMath was pass@1. Though maybe they used consensus@k instead (outputting the most frequent answer out of k, with only one “final answer” passed to the task-specific verifier) or something.)
It was sampling 1024 reasoning traces, but I don’t expect it was being scored as pass@1024. An appropriate thing to happen there would be best-of-k or majority voting, not running reasoning 1024 times and seeing if at least one result happens to be correct.
Agree, I’m pretty confused about this discrepancy. I can’t rule out that it’s just the “RL can enable emergent capabilities” point.
I am not confused, the results of this paper are expected on my model.
FWIW, I interpret the paper to be making a pretty narrow claim about RL in particular. On the other hand, a lot of the production “RL models” we have seen may not be pure RL. For instance, if you wanted to run a similar test to this paper on DeepSeek-V3+, you would compare DeepSeek-V3 to DeepSeek-R1-Zero (pure RL diff, according to the technical report), not to DeepSeek-R1 (trained with a hard-to-follow mix of SFT and RL). R1-Zero is a worse model than R1, sometimes by a large margin.