It’s a shape of a possible ceiling on capability with R1-like training methods, see previous discussion here. The training is very useful, but it might just be ~pass@400 useful rather than ~without limit like AlphaZero. Since base models are not yet reliable at crucial capabilities at ~pass@400, neither would their RL-trained variants become reliable.
It’s plausibly a good way of measuring how well an RL training method works for LLMs, another thing to hill-climb on. The question is how easy it will be to extend this ceiling when you are aware of it, and the paper tries a few things that fail utterly (multiple RL training methods, different numbers of training steps, see Figure 7), which weakly suggests it might be difficult to get multiple orders of magnitude further quickly.
It’s a shape of a possible ceiling on capability with R1-like training methods, see previous discussion here. The training is very useful, but it might just be ~pass@400 useful rather than ~without limit like AlphaZero. Since base models are not yet reliable at crucial capabilities at ~pass@400, neither would their RL-trained variants become reliable.
It’s plausibly a good way of measuring how well an RL training method works for LLMs, another thing to hill-climb on. The question is how easy it will be to extend this ceiling when you are aware of it, and the paper tries a few things that fail utterly (multiple RL training methods, different numbers of training steps, see Figure 7), which weakly suggests it might be difficult to get multiple orders of magnitude further quickly.