In fact, base model seem to be better than RL models at reasoning, when you take best of N (with the same N for both the RL’d and base model). Check out my post summarizing the research on the matter:
Yue, Chen et al. have a different hypothesis: what if the base model already knows all the reasoning trajectories, and all RL does is increase the frequency of reasoning or the frequency of the trajectory that is likely to work? To test this, Yue, Chen et al. use pass@K: let’s give the LLM a total of K attempts to answer the question, and if any of them succeed, mark the question as answering correctly. They report the proportion of correct questions in the data set.
If the RL model genuinely learns new reasoning skills, over many questions the pass@K performance of RL will remain higher than the performance of the base model. As we increase K, the base model answers more and more of the easy questions, so its performance improves. But the RL model’s performance also answers more and more difficult questions. The performance of both increases in tandem with larger K.
What actually happened is neither of these two things. For large enough K, the base model does better than the RL model. (!!!)
Fwiw I’m skeptical that this holds at higher levels of RL compared to those done in the paper. Do you think that a base model can get gold on the IMO at any level of sampling?
Vacuously true. The actual question is: how much do you need to sample? My guess is it’s too much, but we’d see the base model scaling better than the RL’d model just like in this paper.
Fortunately, DeepSeek’s Mathv2 just dropped which is an open-source model that gets IMO gold. We can do the experiment: is it similarly not improving with sampling compared to its own base model? My guess is yes, the same will happen.
In fact, base model seem to be better than RL models at reasoning, when you take best of N (with the same N for both the RL’d and base model). Check out my post summarizing the research on the matter:
Fwiw I’m skeptical that this holds at higher levels of RL compared to those done in the paper. Do you think that a base model can get gold on the IMO at any level of sampling?
Vacuously true. The actual question is: how much do you need to sample? My guess is it’s too much, but we’d see the base model scaling better than the RL’d model just like in this paper.
Fortunately, DeepSeek’s Mathv2 just dropped which is an open-source model that gets IMO gold. We can do the experiment: is it similarly not improving with sampling compared to its own base model? My guess is yes, the same will happen.