The point is the distinction between pretraining and RL (in the level of capability), and between manual jagged RLVR and hypothetical general RL (in the generality of capability). I think observing Opus 4.5 and Gemini 3 Pro is sufficient to be somewhat confident that even at 2026 compute levels pretraining itself won’t be sufficient for AGI (it won’t train sufficiently competent in-context learning behavior to let AIs work around all their hobblings), while IMO gold medal results (even with DeepSeek-V3 model size) demonstrate that RLVR is strong enough to get superhuman capabilities in the narrow skills it gets applied to (especially when it’s 1-4T active param models rather than DeepSeek-V3). So in the current regime (until 2029-2031, when yet another level of compute becomes available) AGI requires some kind of general RL, and continual learning doesn’t necessarily enable it on its own, even if it becomes very useful for the purposes of on-the-job training of AI instances.
This is more of a claim that timelines don’t get shorter within the 2026-2028 window because of continual learning, even if it’s understood as something that significantly increases AI adoption and secures funding for 5+ GW training systems by 2028-2030 (as well as rouses the public via job displacement). That is, starting with timelines without continual learning (appearing as its own thing, rather than an aspect of AGI), where AGI doesn’t appear in 2026-2028, I think adding continual learning (on its own) doesn’t obviously give AGI either, assuming continual learning is not actual automated RLVR (AIs applying RLVR automatically to add new skills to themselves). After 2029-2031, there are new things that could be attempted, such as next word prediction RLVR, and enough time will pass that new ideas might get ready, so I’m only talking very near term.
After 2029-2031, there are new things that could be attempted, such as next word prediction RLVR, and enough time will pass that new ideas might get ready, so I’m only talking very near term.
As an aside, next word prediction RLVR has always struck me as a strange idea. If we’d like to improve the task of next token prediction, we know how to do that directly by scaling laws. That is, I’d be surprised if having the model think about the next token in a discrete space for N steps would beat making the model N times larger thinking in continuous space, since in the former case most of the computation of the forward pass is wasted. There are also practical difficulties; e.g. it’s bottlenecked on memory bandwidth and is harder to parallelize.
Next word prediction RLVR is massively more compute hungry per unit of data (than pretraining), so it’s both likely impractical at current levels of compute, and plausibly solves text data scarcity at 2028-2030 levels of compute if it’s useful. The benefit is generality of the objective, the same as with pretraining itself, compared to manual construction of RL environments for narrow tasks. Given pretraining vs. RLVR capability gap, it’s plausibly a big deal if it makes RL-level capabilities as general as the current shallow pretraining-level capabilities.
The fact that Łukasz Kaiser (transformer paper co-author, currently at OpenAI) is talking about it in Nov 2025 is strong evidence AI companies couldn’t yet rule out that it might work. The idea itself is obvious enough, but that’s less significant as evidence for its prospects.
I agree that it requires a lot of compute, but I think that misunderstands the objection. My claim is that for any level of compute, scaling parameters or training epochs using existing pretraining recipes will be more compute efficient than RLVR for the task of next token prediction. One reason is that by scaling models you can directly optimize the cross entropy objective through gradient descent, but with RLVR you have to sample intermediate tokens, and optimizing over these discrete tokens is difficult and inefficient. That being said, I could imagine there being some other objective besides next token prediction for which RLVR could have an advantage over pretraining. This is what I imagine Łukasz is working on.
The point is the distinction between pretraining and RL (in the level of capability), and between manual jagged RLVR and hypothetical general RL (in the generality of capability). I think observing Opus 4.5 and Gemini 3 Pro is sufficient to be somewhat confident that even at 2026 compute levels pretraining itself won’t be sufficient for AGI (it won’t train sufficiently competent in-context learning behavior to let AIs work around all their hobblings), while IMO gold medal results (even with DeepSeek-V3 model size) demonstrate that RLVR is strong enough to get superhuman capabilities in the narrow skills it gets applied to (especially when it’s 1-4T active param models rather than DeepSeek-V3). So in the current regime (until 2029-2031, when yet another level of compute becomes available) AGI requires some kind of general RL, and continual learning doesn’t necessarily enable it on its own, even if it becomes very useful for the purposes of on-the-job training of AI instances.
This is more of a claim that timelines don’t get shorter within the 2026-2028 window because of continual learning, even if it’s understood as something that significantly increases AI adoption and secures funding for 5+ GW training systems by 2028-2030 (as well as rouses the public via job displacement). That is, starting with timelines without continual learning (appearing as its own thing, rather than an aspect of AGI), where AGI doesn’t appear in 2026-2028, I think adding continual learning (on its own) doesn’t obviously give AGI either, assuming continual learning is not actual automated RLVR (AIs applying RLVR automatically to add new skills to themselves). After 2029-2031, there are new things that could be attempted, such as next word prediction RLVR, and enough time will pass that new ideas might get ready, so I’m only talking very near term.
As an aside, next word prediction RLVR has always struck me as a strange idea. If we’d like to improve the task of next token prediction, we know how to do that directly by scaling laws. That is, I’d be surprised if having the model think about the next token in a discrete space for N steps would beat making the model N times larger thinking in continuous space, since in the former case most of the computation of the forward pass is wasted. There are also practical difficulties; e.g. it’s bottlenecked on memory bandwidth and is harder to parallelize.
Next word prediction RLVR is massively more compute hungry per unit of data (than pretraining), so it’s both likely impractical at current levels of compute, and plausibly solves text data scarcity at 2028-2030 levels of compute if it’s useful. The benefit is generality of the objective, the same as with pretraining itself, compared to manual construction of RL environments for narrow tasks. Given pretraining vs. RLVR capability gap, it’s plausibly a big deal if it makes RL-level capabilities as general as the current shallow pretraining-level capabilities.
The fact that Łukasz Kaiser (transformer paper co-author, currently at OpenAI) is talking about it in Nov 2025 is strong evidence AI companies couldn’t yet rule out that it might work. The idea itself is obvious enough, but that’s less significant as evidence for its prospects.
I agree that it requires a lot of compute, but I think that misunderstands the objection. My claim is that for any level of compute, scaling parameters or training epochs using existing pretraining recipes will be more compute efficient than RLVR for the task of next token prediction. One reason is that by scaling models you can directly optimize the cross entropy objective through gradient descent, but with RLVR you have to sample intermediate tokens, and optimizing over these discrete tokens is difficult and inefficient. That being said, I could imagine there being some other objective besides next token prediction for which RLVR could have an advantage over pretraining. This is what I imagine Łukasz is working on.