Out of domain (i.e. on a different math benchmark) the RLed model does better at pass@256, especially when using algorithms like RLOO and Reinforce++. If there is a crossover point it is in the thousands. (Figure 7)
Looking at Figure 7, I think the latest intersection point for the top right plot (MATH500, RLOO) is about pass@512, while for the bottom right (MATH500, step150) it’s between pass@512 and pass@1024 (and gets worse with further RL training), so probably not in the thousands.
Looking at Figure 7, I think the latest intersection point for the top right plot (MATH500, RLOO) is about pass@512, while for the bottom right (MATH500, step150) it’s between pass@512 and pass@1024 (and gets worse with further RL training), so probably not in the thousands.