Jozdien comments on Tsinghua paper: Does RL Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Jozdien 6 May 2025 0:20 UTC
9 points
2
o3 is a lot better than o1 in a way which suggests that RL budgets do scale heavily with xompute, and o3 if anything is better at scaling up in a Pass@N way (o3 is reported to be fully parallelizable, capable of scaling up to $1000s of compute).
o3 may also have a better base model. o3 could be worse at pass@n for high n relative to its base model than o1 is relative to its base model, while still being better than o1.
I don’t think you need very novel RL algorithms for this either—in the paper, Reinforce++ still does better for pass@256 in all cases. For very high k, pass@k being higher for the base model may just imply that the base model has a broader distribution to sample from, while at lower k the RL’d models benefit from higher reliability. This would imply that it’s not a question of how to do RL such that the RL model is always better at any k, but how to trade off reliability for a more diverse distribution (and push the Pareto frontier ahead).