Vladimir_Nesov comments on How Well Does RL Scale?

Vladimir_Nesov 23 Oct 2025 1:56 UTC
2 points
0
In the long run, if (contribution to the quality of result from) RL scales slower than pretraining and both are used at a similar scale, that just means that RL doesn’t improve the overall speed of scaling (in the model quality with compute) compared to pretraining-only scaling, and it wouldn’t matter how much slower RL scaling is. But also, pretraining might face a scaling ceiling due to training data running out, while RL likely won’t, in which case slower scaling of RL predicts slower scaling overall compared to pretraining-only scaling, once pretraining can no longer be usefully scaled.

I would guess that RL will look more like a power of 1.5-2 worse than pretraining rather a power of 3 worse

There’s some compute optimal ratio of pretraining compute to RL compute (describing the tradeoff within a fixed budget of total compute or GPU-time), which depends on the amount of total compute. If usefulness of RL and pretraining scale differently, then that ratio will tend either up or down without bound (so that you’d want almost all compute to go to pretraining, or almost all compute to go to RL, if you have enough compute to extremize the ratio).

What matters in practice is then where that ratio is in the near future (at 1e26-1e29 FLOPs of total compute). Also, there’s going to be some lower bound where at least 10-30% will always be spent on either as long as they remain scalable and enable that much in some way, because they are doing different things and one of them will always have an outsized impact on some aspects of the resulting models. In particular, RL enables training in task-specific RL environments, giving models competence in things they just can’t learn from pretraining (on natural data), so there’s going to be a growing collection of RL environments that teach models more and more skills, which in practice might end up consuming the majority of the compute budget.

So even if for capabilities usefully trainable with both pretraining and RL it turns out that allocating 5% to RL is compute optimal at 1e28 FLOPs, in practice 70% of compute (or GPU-time) might still go to RL, because the capabilities that are only trainable with RL end up being more important than doing a bit better on the capabilities trainable with either (by navigating the compute optimal tradeoff between the two). Also, natural text data for pretraining is running out (at around 1e27-1e28 FLOPs), while RL is likely to remain capable of making use of more compute, which also counts towards allocating more compute for RL training.
- Toby_Ord 23 Oct 2025 8:13 UTC
  3 points
  0
  Parent
  Yes, you would get an optimal allocation with non-zero amounts to each. A simple calculation suggests 1:2 ratio of RL-OOMs : Inference-OOMs. e.g. scaling up RL by 100x and inference by 10,000x. So it could easily lead to RL compute becoming an ever-smaller fraction of FLOPs. But there are additional complications from the fact that inference is a flow of costs and also increases with the number of users, while RL is a fixed cost.
  On the simple model and with my scaling numbers, the contribution of RL to capabilities (keeping token-use fixed) would be 20% — a 1:4 ratio with inference because half as many OOMs and half the effect per OOM.
  The main relevance of all this to me is that even if people keep doing RL, RL alone won’t contribute much to benchmark performance. I think it would need to 100,000x current total training compute to gain the equivalent of just 100x on pretraining in the early years. So if pre-training is slowing, AI companies lack any current method of effective compute scaling based solely around training compute and one-off costs.
  - Vladimir_Nesov 23 Oct 2025 18:22 UTC
    2 points
    0
    Parent
    RL can develop particular skills, and given that IMO has fallen this year, it’s unclear that further general capability improvement is essential at this point. If RL can help cobble together enough specialized skills to enable automated adaptation (where the AI itself will become able to prepare datasets or RL environments etc. for specific jobs or sources of tasks), that might be enough. If RL enables longer contexts that can serve the role of continual learning, that also might be enough. Currently, there is a lot of low hanging fruit, and little things continue to stack.
    
    So if pre-training is slowing, AI companies lack any current method of effective compute scaling based solely around training compute and one-off costs.
    
    It’s compute that’s slowing, not specifically pre-training, because the financing/industry can’t scale much longer. The costs of training were increasing about 6x every 2 years, resulting in 12x increase in training compute every 2 years in 2022-2026. Possibly another 2x on top of that every 2 years from adoption of reduced floating point precision in training, going from BF16 to FP8 and soon possibly to NVFP4 (likely it won’t go any further). A 1 GW system of 2026 costs an AI company about $10bn a year. There’s maybe 2-3 more years at this pace in principle, but more likely the slowdown will be gradually starting sooner, and then it’s Moore’s law (of price-performance) again, to the extent that it’s still real (which is somewhat unclear).