This appears true at the academic scale, but not at the frontier scale where RL compute consumption is much higher (sometimes even higher that pretraining).
Furthermore, Nemotron-Research-Reasoning-Qwen-1.5B offers surprising new insights —RL can indeed discover genuinely new solution pathways entirely absent in base models, when given sufficient training time and applied to novel reasoning tasks. Through comprehensive analysis, we show that our model generates novel insights and performs exceptionally well on tasks with increasingly difficult and out-of-domain tasks, suggesting a genuine expansion of reasoning capabilities beyond its initial training. Most strikingly, we identify many tasks where the base model fails to produce any correct solutions regardless of the amount of sampling, while our RL-trained model achieves 100% pass rates (Figure 4).
That’s an interesting paper. I’m not quite up to thinking through the implications in full due to lack of sleep, but demarcating scenarios where fine-tuning clips the existing distribution of behavior versus scenarios where it leads to something new entirely could be a valuable distinction. I wonder what that looks like in terms of network structure.
This appears true at the academic scale, but not at the frontier scale where RL compute consumption is much higher (sometimes even higher that pretraining).
As a counter-example to your evidence, when Nvidia scaled up their RL they found:
That’s an interesting paper. I’m not quite up to thinking through the implications in full due to lack of sleep, but demarcating scenarios where fine-tuning clips the existing distribution of behavior versus scenarios where it leads to something new entirely could be a valuable distinction. I wonder what that looks like in terms of network structure.