But it fits with the extreme information inefficiency of RL training, which (compared to next-token-prediction) receives less than a ten-thousandth as much information to learn from per FLOP of training compute.
If I am interpreting this correctly, there is a subtle mathematical error here: if RL requires a constant factor of 10,000 more compute than pretraining, this only shifts the graph of performance against log(compute), it doesn’t change its slope. For RL to have a shallower slope, the information efficiency would have to decrease more quickly over the course of training for RL than for pretraining.
I think there are few potential reasons why information efficiency might decrease more quickly over the course of training for RL than for pretraining, but it is not so clear-cut:
Increased accuracy: you get fewer bits of information from a more biased coin flip than a fairer one, so information efficiency decreases as you approach 100% accuracy. But it’s not clear whether this applies more to pretraining or to RL. Note also that in both cases the effect can potentially be alleviated by a curriculum.
Longer episodes: assuming RL just has a single binary reward at the end of each episode, information density decreases as episodes get longer. Since harder tasks require longer chains of thought, this one seems to clearly count against RL.
Overfitting: if there is a mismatch between the training distribution used for RL and the distribution used to benchmark the model, one might expect the density of information relevant to the benchmark to decrease as the model overfits to the training distribution. I think this one also counts against RL right now, but can be alleviated by improving data quality and quantity.
In particular, I think the fact that overfitting can be mitigated with better data cuts against your empirical observations. Since, as you correctly note, RL compute started from a very small base, it was initially much cheaper to scale up compute than to scale up data. But as RL compute becomes more expensive, it will become comparatively more cost-effective to scale up data. Once spending on both is being scaled up at a similar rate (as is economically inevitable as long as spending continues to increase), we should expect to see some regression towards the pretraining slope in my opinion.
Overall, I think the effect you spotted is real (due to things like episode length), but ultimately won’t turn out to be as extreme as you estimated here. Quantitatively, I would guess that RL will look more like a power of 1.5-2 worse than pretraining rather a power of 3 worse, and there could be certain training regimes (e.g. fixed episode length) where they are closer than that.
Thanks Jacob. It is less of a mathematical mistake and more me trying to make a qualitative connection between the observed poor scaling of RL training and theoretical mechanism I’d just written about of poor information efficiency, both of which look very big. I agree that the theoretical explanation doesn’t seem to be quite the right shape to explain the empirical issue.
Of your potential reasons, I do think longer episodes is part of it. The R1 paper has a chart on page 8 showing that without trying to affect episode lengths, they increased linearly from 500 tokens to ~9000 tokens over 8000 episodes, suggesting pretty much 1 token increase per episode on average. Thus the information efficiency was going down linearly with episodes during training. It is a bit tricky to compare this with the o1 chart, whose x-axis is both logarithmic and also measuring training compute rather than episode number. I think this means it should be declining as 1/episodes = 1/root(compute) — since training compute would be growing as the square of the number of episodes. And I think that just accounts for the power being 0.5 lower than pretraining, rather than the 3 that I’m claiming. (But its late, and I haven’t checked this through on paper.)
I do think you are right that the information inefficiency can’t explain the whole issue, but it might be able to explain part of it. i.e. a shift to the left can’t explain a line with a different slope, but the slope changing part of the way there could be part of the explanation.
Actually, here is a slightly simpler way to think about it. How many more training steps do you do with RL when you 100x the compute? Given the linear episode length growth, you only do root(100) = 10x the number of training steps. So if capability gain were linear in the log of the number of training steps, it would grow as log(root(compute)) = log(compute)/2, whereas for pretraining it would grow as log(compute). So if inference-scaling were going as well as pre-training scaling (contra the 3⁄2 estimate I appealed to in my piece) then the information inefficiency theoretical explanation could exactly account for the observed scaling behaviour.
I’m not sure this is right (there were a couple of biggish assumptions there) but it does feel closer to being able to be a larger part of the actual explanation.
Nice observation, and I agree with your calculation that linear episode length growth would account for a worse scaling exponent by a factor of 2 (or more generally, episode length growing with exponent k would account for a worse scaling exponent by a factor of k+1).
Note also that this suggests a potential remedy, namely controlling episode length, but there is less incentive to apply this when data is more of a constraint than compute.
In the long run, if (contribution to the quality of result from) RL scales slower than pretraining and both are used at a similar scale, that just means that RL doesn’t improve the overall speed of scaling (in the model quality with compute) compared to pretraining-only scaling, and it wouldn’t matter how much slower RL scaling is. But also, pretraining might face a scaling ceiling due to training data running out, while RL likely won’t, in which case slower scaling of RL predicts slower scaling overall compared to pretraining-only scaling, once pretraining can no longer be usefully scaled.
I would guess that RL will look more like a power of 1.5-2 worse than pretraining rather a power of 3 worse
There’s some compute optimal ratio of pretraining compute to RL compute (describing the tradeoff within a fixed budget of total compute or GPU-time), which depends on the amount of total compute. If usefulness of RL and pretraining scale differently, then that ratio will tend either up or down without bound (so that you’d want almost all compute to go to pretraining, or almost all compute to go to RL, if you have enough compute to extremize the ratio).
What matters in practice is then where that ratio is in the near future (at 1e26-1e29 FLOPs of total compute). Also, there’s going to be some lower bound where at least 10-30% will always be spent on either as long as they remain scalable and enable that much in some way, because they are doing different things and one of them will always have an outsized impact on some aspects of the resulting models. In particular, RL enables training in task-specific RL environments, giving models competence in things they just can’t learn from pretraining (on natural data), so there’s going to be a growing collection of RL environments that teach models more and more skills, which in practice might end up consuming the majority of the compute budget.
So even if for capabilities usefully trainable with both pretraining and RL it turns out that allocating 5% to RL is compute optimal at 1e28 FLOPs, in practice 70% of compute (or GPU-time) might still go to RL, because the capabilities that are only trainable with RL end up being more important than doing a bit better on the capabilities trainable with either (by navigating the compute optimal tradeoff between the two). Also, natural text data for pretraining is running out (at around 1e27-1e28 FLOPs), while RL is likely to remain capable of making use of more compute, which also counts towards allocating more compute for RL training.
Yes, you would get an optimal allocation with non-zero amounts to each. A simple calculation suggests 1:2 ratio of RL-OOMs : Inference-OOMs. e.g. scaling up RL by 100x and inference by 10,000x. So it could easily lead to RL compute becoming an ever-smaller fraction of FLOPs. But there are additional complications from the fact that inference is a flow of costs and also increases with the number of users, while RL is a fixed cost.
On the simple model and with my scaling numbers, the contribution of RL to capabilities (keeping token-use fixed) would be 20% — a 1:4 ratio with inference because half as many OOMs and half the effect per OOM.
The main relevance of all this to me is that even if people keep doing RL, RL alone won’t contribute much to benchmark performance. I think it would need to 100,000x current total training compute to gain the equivalent of just 100x on pretraining in the early years. So if pre-training is slowing, AI companies lack any current method of effective compute scaling based solely around training compute and one-off costs.
RL can develop particular skills, and given that IMO has fallen this year, it’s unclear that further general capability improvement is essential at this point. If RL can help cobble together enough specialized skills to enable automated adaptation (where the AI itself will become able to prepare datasets or RL environments etc. for specific jobs or sources of tasks), that might be enough. If RL enables longer contexts that can serve the role of continual learning, that also might be enough. Currently, there is a lot of low hanging fruit, and little things continue to stack.
So if pre-training is slowing, AI companies lack any current method of effective compute scaling based solely around training compute and one-off costs.
It’s compute that’s slowing, not specifically pre-training, because the financing/industry can’t scale much longer. The costs of training were increasing about 6x every 2 years, resulting in 12x increase in training compute every 2 years in 2022-2026. Possibly another 2x on top of that every 2 years from adoption of reduced floating point precision in training, going from BF16 to FP8 and soon possibly to NVFP4 (likely it won’t go any further). A 1 GW system of 2026 costs an AI company about $10bn a year. There’s maybe 2-3 more years at this pace in principle, but more likely the slowdown will be gradually starting sooner, and then it’s Moore’s law (of price-performance) again, to the extent that it’s still real (which is somewhat unclear).
I’m getting somewhat confused about information-theoretic arguments around RL scaling. What makes sense to me is that: information density is constant per token in pre-training, no matter how long you make contexts, but decrease 1/n as you make RL trajectories longer. This means that if you look at just scaling context length, RL should get asymptotically less efficient.
What’s not clear to me is the relationship between “bits getting into the weights” and capabilities. Using the information-theoretic argument above, you’d probably get that in o3, one millionth of the information in the weights comes from RL, or something like that, I’m not sure. But o3′s advance in capabilities over 4o seem clearly far more than a millionth factor improvement. I think this would be true even if you work to disentangle inference time scaling and RL scaling. Eg ratio of bits in o1 vs o3. Number of bits in o3 over o1 is very small, but thinking for the same time, the difference is very noticeable.
Thanks for this insightful analysis!
If I am interpreting this correctly, there is a subtle mathematical error here: if RL requires a constant factor of 10,000 more compute than pretraining, this only shifts the graph of performance against log(compute), it doesn’t change its slope. For RL to have a shallower slope, the information efficiency would have to decrease more quickly over the course of training for RL than for pretraining.
I think there are few potential reasons why information efficiency might decrease more quickly over the course of training for RL than for pretraining, but it is not so clear-cut:
Increased accuracy: you get fewer bits of information from a more biased coin flip than a fairer one, so information efficiency decreases as you approach 100% accuracy. But it’s not clear whether this applies more to pretraining or to RL. Note also that in both cases the effect can potentially be alleviated by a curriculum.
Longer episodes: assuming RL just has a single binary reward at the end of each episode, information density decreases as episodes get longer. Since harder tasks require longer chains of thought, this one seems to clearly count against RL.
Overfitting: if there is a mismatch between the training distribution used for RL and the distribution used to benchmark the model, one might expect the density of information relevant to the benchmark to decrease as the model overfits to the training distribution. I think this one also counts against RL right now, but can be alleviated by improving data quality and quantity.
In particular, I think the fact that overfitting can be mitigated with better data cuts against your empirical observations. Since, as you correctly note, RL compute started from a very small base, it was initially much cheaper to scale up compute than to scale up data. But as RL compute becomes more expensive, it will become comparatively more cost-effective to scale up data. Once spending on both is being scaled up at a similar rate (as is economically inevitable as long as spending continues to increase), we should expect to see some regression towards the pretraining slope in my opinion.
Overall, I think the effect you spotted is real (due to things like episode length), but ultimately won’t turn out to be as extreme as you estimated here. Quantitatively, I would guess that RL will look more like a power of 1.5-2 worse than pretraining rather a power of 3 worse, and there could be certain training regimes (e.g. fixed episode length) where they are closer than that.
Thanks Jacob. It is less of a mathematical mistake and more me trying to make a qualitative connection between the observed poor scaling of RL training and theoretical mechanism I’d just written about of poor information efficiency, both of which look very big. I agree that the theoretical explanation doesn’t seem to be quite the right shape to explain the empirical issue.
Of your potential reasons, I do think longer episodes is part of it. The R1 paper has a chart on page 8 showing that without trying to affect episode lengths, they increased linearly from 500 tokens to ~9000 tokens over 8000 episodes, suggesting pretty much 1 token increase per episode on average. Thus the information efficiency was going down linearly with episodes during training. It is a bit tricky to compare this with the o1 chart, whose x-axis is both logarithmic and also measuring training compute rather than episode number. I think this means it should be declining as 1/episodes = 1/root(compute) — since training compute would be growing as the square of the number of episodes. And I think that just accounts for the power being 0.5 lower than pretraining, rather than the 3 that I’m claiming. (But its late, and I haven’t checked this through on paper.)
I do think you are right that the information inefficiency can’t explain the whole issue, but it might be able to explain part of it. i.e. a shift to the left can’t explain a line with a different slope, but the slope changing part of the way there could be part of the explanation.
Actually, here is a slightly simpler way to think about it. How many more training steps do you do with RL when you 100x the compute? Given the linear episode length growth, you only do root(100) = 10x the number of training steps. So if capability gain were linear in the log of the number of training steps, it would grow as log(root(compute)) = log(compute)/2, whereas for pretraining it would grow as log(compute). So if inference-scaling were going as well as pre-training scaling (contra the 3⁄2 estimate I appealed to in my piece) then the information inefficiency theoretical explanation could exactly account for the observed scaling behaviour.
I’m not sure this is right (there were a couple of biggish assumptions there) but it does feel closer to being able to be a larger part of the actual explanation.
Nice observation, and I agree with your calculation that linear episode length growth would account for a worse scaling exponent by a factor of 2 (or more generally, episode length growing with exponent k would account for a worse scaling exponent by a factor of k+1).
Note also that this suggests a potential remedy, namely controlling episode length, but there is less incentive to apply this when data is more of a constraint than compute.
In the long run, if (contribution to the quality of result from) RL scales slower than pretraining and both are used at a similar scale, that just means that RL doesn’t improve the overall speed of scaling (in the model quality with compute) compared to pretraining-only scaling, and it wouldn’t matter how much slower RL scaling is. But also, pretraining might face a scaling ceiling due to training data running out, while RL likely won’t, in which case slower scaling of RL predicts slower scaling overall compared to pretraining-only scaling, once pretraining can no longer be usefully scaled.
There’s some compute optimal ratio of pretraining compute to RL compute (describing the tradeoff within a fixed budget of total compute or GPU-time), which depends on the amount of total compute. If usefulness of RL and pretraining scale differently, then that ratio will tend either up or down without bound (so that you’d want almost all compute to go to pretraining, or almost all compute to go to RL, if you have enough compute to extremize the ratio).
What matters in practice is then where that ratio is in the near future (at 1e26-1e29 FLOPs of total compute). Also, there’s going to be some lower bound where at least 10-30% will always be spent on either as long as they remain scalable and enable that much in some way, because they are doing different things and one of them will always have an outsized impact on some aspects of the resulting models. In particular, RL enables training in task-specific RL environments, giving models competence in things they just can’t learn from pretraining (on natural data), so there’s going to be a growing collection of RL environments that teach models more and more skills, which in practice might end up consuming the majority of the compute budget.
So even if for capabilities usefully trainable with both pretraining and RL it turns out that allocating 5% to RL is compute optimal at 1e28 FLOPs, in practice 70% of compute (or GPU-time) might still go to RL, because the capabilities that are only trainable with RL end up being more important than doing a bit better on the capabilities trainable with either (by navigating the compute optimal tradeoff between the two). Also, natural text data for pretraining is running out (at around 1e27-1e28 FLOPs), while RL is likely to remain capable of making use of more compute, which also counts towards allocating more compute for RL training.
Yes, you would get an optimal allocation with non-zero amounts to each. A simple calculation suggests 1:2 ratio of RL-OOMs : Inference-OOMs. e.g. scaling up RL by 100x and inference by 10,000x. So it could easily lead to RL compute becoming an ever-smaller fraction of FLOPs. But there are additional complications from the fact that inference is a flow of costs and also increases with the number of users, while RL is a fixed cost.
On the simple model and with my scaling numbers, the contribution of RL to capabilities (keeping token-use fixed) would be 20% — a 1:4 ratio with inference because half as many OOMs and half the effect per OOM.
The main relevance of all this to me is that even if people keep doing RL, RL alone won’t contribute much to benchmark performance. I think it would need to 100,000x current total training compute to gain the equivalent of just 100x on pretraining in the early years. So if pre-training is slowing, AI companies lack any current method of effective compute scaling based solely around training compute and one-off costs.
RL can develop particular skills, and given that IMO has fallen this year, it’s unclear that further general capability improvement is essential at this point. If RL can help cobble together enough specialized skills to enable automated adaptation (where the AI itself will become able to prepare datasets or RL environments etc. for specific jobs or sources of tasks), that might be enough. If RL enables longer contexts that can serve the role of continual learning, that also might be enough. Currently, there is a lot of low hanging fruit, and little things continue to stack.
It’s compute that’s slowing, not specifically pre-training, because the financing/industry can’t scale much longer. The costs of training were increasing about 6x every 2 years, resulting in 12x increase in training compute every 2 years in 2022-2026. Possibly another 2x on top of that every 2 years from adoption of reduced floating point precision in training, going from BF16 to FP8 and soon possibly to NVFP4 (likely it won’t go any further). A 1 GW system of 2026 costs an AI company about $10bn a year. There’s maybe 2-3 more years at this pace in principle, but more likely the slowdown will be gradually starting sooner, and then it’s Moore’s law (of price-performance) again, to the extent that it’s still real (which is somewhat unclear).
I’m getting somewhat confused about information-theoretic arguments around RL scaling. What makes sense to me is that: information density is constant per token in pre-training, no matter how long you make contexts, but decrease 1/n as you make RL trajectories longer. This means that if you look at just scaling context length, RL should get asymptotically less efficient.
What’s not clear to me is the relationship between “bits getting into the weights” and capabilities. Using the information-theoretic argument above, you’d probably get that in o3, one millionth of the information in the weights comes from RL, or something like that, I’m not sure. But o3′s advance in capabilities over 4o seem clearly far more than a millionth factor improvement. I think this would be true even if you work to disentangle inference time scaling and RL scaling. Eg ratio of bits in o1 vs o3. Number of bits in o3 over o1 is very small, but thinking for the same time, the difference is very noticeable.