Thanks for this insightful analysis!
But it fits with the extreme information inefficiency of RL training, which (compared to next-token-prediction) receives less than a ten-thousandth as much information to learn from per FLOP of training compute.
If I am interpreting this correctly, there is a subtle mathematical error here: if RL requires a constant factor of 10,000 more compute than pretraining, this only shifts the graph of performance against log(compute), it doesn’t change its slope. For RL to have a shallower slope, the information efficiency would have to decrease more quickly over the course of training for RL than for pretraining.
I think there are few potential reasons why information efficiency might decrease more quickly over the course of training for RL than for pretraining, but it is not so clear-cut:
Increased accuracy: you get fewer bits of information from a more biased coin flip than a fairer one, so information efficiency decreases as you approach 100% accuracy. But it’s not clear whether this applies more to pretraining or to RL. Note also that in both cases the effect can potentially be alleviated by a curriculum.
Longer episodes: assuming RL just has a single binary reward at the end of each episode, information density decreases as episodes get longer. Since harder tasks require longer chains of thought, this one seems to clearly count against RL.
Overfitting: if there is a mismatch between the training distribution used for RL and the distribution used to benchmark the model, one might expect the density of information relevant to the benchmark to decrease as the model overfits to the training distribution. I think this one also counts against RL right now, but can be alleviated by improving data quality and quantity.
In particular, I think the fact that overfitting can be mitigated with better data cuts against your empirical observations. Since, as you correctly note, RL compute started from a very small base, it was initially much cheaper to scale up compute than to scale up data. But as RL compute becomes more expensive, it will become comparatively more cost-effective to scale up data. Once spending on both is being scaled up at a similar rate (as is economically inevitable as long as spending continues to increase), we should expect to see some regression towards the pretraining slope in my opinion.
Overall, I think the effect you spotted is real (due to things like episode length), but ultimately won’t turn out to be as extreme as you estimated here. Quantitatively, I would guess that RL will look more like a power of 1.5-2 worse than pretraining rather a power of 3 worse, and there could be certain training regimes (e.g. fixed episode length) where they are closer than that.
Nice observation, and I agree with your calculation that linear episode length growth would account for a worse scaling exponent by a factor of 2 (or more generally, episode length growing with exponent k would account for a worse scaling exponent by a factor of k+1).
Note also that this suggests a potential remedy, namely controlling episode length, but there is less incentive to apply this when data is more of a constraint than compute.