Thomas Kwa comments on Tsinghua paper: Does RL Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Thomas Kwa 5 May 2025 22:44 UTC
33 points
0
I ran the horizon length graph with pass@8 instead, and the increase between GPT-4 and o1 seems to be slightly smaller (could be noise), and also Claude 3.7 does worse than o1. This means the doubling rate for pass@8 may be slightly slower than for pass@1. However, if horizon length increase since 2023 were only due to RL, the improvement from pass@8 would be barely half as much as the improvement in pass@1. It’s faster, which could be due to some combination of the following:
- o1 has a better base model than GPT-4
- HCAST is an example of “emergent capabilities on long-horizon tasks”, unlike their non-agentic tasks
- RL helps more on HCAST skills than math or non-agentic coding
- RL helps more on larger models
- Some kind of nonlinearity in the relevant metrics
There are various problems with this graph (e.g. to follow the same methodology we should filter for models that are frontier on pass@8, not frontier on pass@1) but this was meant to be a quick and dirty check.
GPT-4 horizon o1 horizon Ratio
Pass@8 24 min 151 min 6.3x
Pass@1 5 min 39 min 7.8x
What links here?
- Tsinghua paper: Does RL Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model? by Thomas Kwa (5 May 2025 18:56 UTC; 70 points)