asher comments on Tsinghua paper: Does RL Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

asher 11 May 2025 20:35 UTC
12 points
0
Another reason this could be misleading is that it’s possible that sampling from the model pre-mode-collapse performs better at high k even if RL did teach the model new capabilities. In general, taking a large number of samples from a slightly worse model before mode collapse often outperforms taking a large number of samples from a better model after mode collapse — even if the RL model is actually more powerful (in the sense that its best reasoning pathways are better than the original model’s best reasoning pathways). In other words, the RL model can both be much smarter and still perform worse on these evaluations due to diversity loss. But it’s possible that if you’re, say, trying to train a model to accomplish a difficult/novel reasoning task, it’s much better to start with the smarter RL model than the more diverse original model.
What links here?
- Noosphere89's comment on Absolute Zero: Alpha Zero for LLM by alapmi (15 May 2025 18:19 UTC; 5 points)