Out of domain (i.e. on a different math benchmark) the RLed model does better at pass@256, especially when using algorithms like RLOO and Reinforce++. If there is a crossover point it might be near pass@1024. (Figure 7)
When using Reinforce++, the RL’d models perform better on pass@256 tasks as well:
That said, the curves look like the base model is catching up, and there may be a crossover point of pass@256 or the like for the in-domain tasks as well.
This could be attributable to the base models being less mode-collapsed and being able to sample more diversely. I don’t think that would have predicted this result in advance however, and it likely depends on the specifics of the RL setup.
When using Reinforce++, the RL’d models perform better on pass@256 tasks as well:
That said, the curves look like the base model is catching up, and there may be a crossover point of pass@256 or the like for the in-domain tasks as well.
This could be attributable to the base models being less mode-collapsed and being able to sample more diversely. I don’t think that would have predicted this result in advance however, and it likely depends on the specifics of the RL setup.