Vladimir_Nesov comments on Tsinghua paper: Does RL Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Vladimir_Nesov 5 May 2025 20:20 UTC
2 points
0

Out of domain (i.e. on a different math benchmark) the RLed model does better at pass@256, especially when using algorithms like RLOO and Reinforce++. If there is a crossover point it is in the thousands. (Figure 7)

Looking at Figure 7, I think the latest intersection point for the top right plot (MATH500, RLOO) is about pass@512, while for the bottom right (MATH500, step150) it’s between pass@512 and pass@1024 (and gets worse with further RL training), so probably not in the thousands.