Siebe comments on Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Siebe 25 Apr 2025 13:46 UTC
8 points
0
Yes it matters for current model performance, but it means that RLVR isn’t actually improving the model in a way that can be used for an iterated distillation & amplification loop, because it doesn’t actually do real amplification. If this turns out right, it’s quite bearish for AI timelines

Edit: Ah someone just alerted me to the crucial consideration that this was tested using smaller models (like Qwen-2.5 (7B/14B/32B) and LLaMA-3.1-8B, which are significantly smaller than the models where RLVR has shown the most dramatic improvements (like DeepSeek-V3 → R1 or GPT-4o → o1). And given that different researchers have claimed that there’s a threshold effect, substantially weakens these findings. But they say they’re currently evaluating DeepSeek V3- & R1 so I guess we’ll see
- Siebe 26 Apr 2025 7:56 UTC
  1 point
  0
  Parent
  More thoughts:
  
  I thought that AlphaZero was a counterpoint, but apparently it’s significantly different. For example, it used true self-play allowing it to discover fully novel strategies.
  
  Then again, I don’t think more sophisticated reasoning is the bottleneck to AGI (compared to executive function & tool use), so even if reasoning doesn’t really improve for a few years we could get AGI.
  
  However, I previously thought reasoning models could be leveraged to figure out how to achieve actions, and then the best actions would be distilled into a better agent model, you know, IDA-style. But this paper makes me more skeptical of that working, because these agentic steps might require novel skills that aren’t inside the training data.