Vladimir_Nesov comments on How Well Does RL Scale?

Vladimir_Nesov 22 Oct 2025 15:10 UTC
5 points
1
The clues about the compute optimal pretraining-to-RL ratio are interesting. RLVR still has a long way to go with even longer reasoning traces (sequential inference scaling), since currently it’s still mostly 20k-50k tokens, while 1M contexts are supported even for current models, and soon the suddenly increasing HBM capacity of scale-up worlds for newer hardware^[1] will enable much longer contexts, even for larger models. Contexts this long don’t really work after pretraining, but RLVR might be able to make them work.

An AGI-relevant implication of longer contexts (as opposed to pretraining or RL scaling) is that continual learning (the crucial capability of automated adaptation) currently only really works within a context, as an aspect of in-context learning. If contexts were to increase to millions of tokens, they could in principle hold the equivalent of years of experience (3 years at 40k tokens per day is 44M tokens). If scaling and RLVR make such tokens do a good enough job at implementing continual learning, algorithmic innovations that go about it in a more reasonable way wouldn’t be necessary to overcome this obstruction.
1. ↩︎
  The change is from about 1 TB for 8-chip Nvidia servers to 14-20 TB for GB200/GB300 NVL72 getting online this year, and there’s also 50 TB for a TPUv7 Ironwood pod. Thus in 2026-2027, it suddenly becomes possible to serve much larger models than in 2023-2025, as a step change rather than gradually. And for the same reason the models will be able to work with much longer contexts, at least in principle.
- StanislavKrym 22 Oct 2025 15:49 UTC
  1 point
  0
  Parent
  Except that the METR evaluation had GPT-5 use a budget of more than a hundred dollars per run and, as far as I understand, a context window of 400K tokens. The price of usage of GPT-5 is $10/M output tokens.
  As for big contexts and their utility, I tried to cover it in my analysis^[1] of the ARC-AGI benchmark in the high-cost regime where Claude Sonnet 4.5, Grok 4, GPT-5-pro form another straight line with low inclination.
  1. ^
    However, it could have been invalidated by the appearance of Grok 4 Fast on the ARC-AGI-1 benchmark. While Grok 4 Fast is a low-cost model, it significantly surpassed the Pareto frontier.
  - Vladimir_Nesov 22 Oct 2025 15:57 UTC
    3 points
    0
    Parent
    I’m distinguishing sequential inference scaling (the length of a single reasoning trace, specifically the output tokens) from parallel scaling (or more generally agentic multi-trace interaction processes). GPT-5 Pro and such will work with more tokens per request by running things in parallel, agentic scaffolds will generate more tokens by spinning up new instances for subtasks as the situation develops, and often there are many more input tokens in a context than there are output tokens generated within a single reasoning trace.