After 2029-2031, there are new things that could be attempted, such as next word prediction RLVR, and enough time will pass that new ideas might get ready, so I’m only talking very near term.
As an aside, next word prediction RLVR has always struck me as a strange idea. If we’d like to improve the task of next token prediction, we know how to do that directly by scaling laws. That is, I’d be surprised if having the model think about the next token in a discrete space for N steps would beat making the model N times larger thinking in continuous space, since in the former case most of the computation of the forward pass is wasted. There are also practical difficulties; e.g. it’s bottlenecked on memory bandwidth and is harder to parallelize.
Next word prediction RLVR is massively more compute hungry per unit of data (than pretraining), so it’s both likely impractical at current levels of compute, and plausibly solves text data scarcity at 2028-2030 levels of compute if it’s useful. The benefit is generality of the objective, the same as with pretraining itself, compared to manual construction of RL environments for narrow tasks. Given pretraining vs. RLVR capability gap, it’s plausibly a big deal if it makes RL-level capabilities as general as the current shallow pretraining-level capabilities.
The fact that Łukasz Kaiser (transformer paper co-author, currently at OpenAI) is talking about it in Nov 2025 is strong evidence AI companies couldn’t yet rule out that it might work. The idea itself is obvious enough, but that’s less significant as evidence for its prospects.
I agree that it requires a lot of compute, but I think that misunderstands the objection. My claim is that for any level of compute, scaling parameters or training epochs using existing pretraining recipes will be more compute efficient than RLVR for the task of next token prediction. One reason is that by scaling models you can directly optimize the cross entropy objective through gradient descent, but with RLVR you have to sample intermediate tokens, and optimizing over these discrete tokens is difficult and inefficient. That being said, I could imagine there being some other objective besides next token prediction for which RLVR could have an advantage over pretraining. This is what I imagine Łukasz is working on.
As an aside, next word prediction RLVR has always struck me as a strange idea. If we’d like to improve the task of next token prediction, we know how to do that directly by scaling laws. That is, I’d be surprised if having the model think about the next token in a discrete space for N steps would beat making the model N times larger thinking in continuous space, since in the former case most of the computation of the forward pass is wasted. There are also practical difficulties; e.g. it’s bottlenecked on memory bandwidth and is harder to parallelize.
Next word prediction RLVR is massively more compute hungry per unit of data (than pretraining), so it’s both likely impractical at current levels of compute, and plausibly solves text data scarcity at 2028-2030 levels of compute if it’s useful. The benefit is generality of the objective, the same as with pretraining itself, compared to manual construction of RL environments for narrow tasks. Given pretraining vs. RLVR capability gap, it’s plausibly a big deal if it makes RL-level capabilities as general as the current shallow pretraining-level capabilities.
The fact that Łukasz Kaiser (transformer paper co-author, currently at OpenAI) is talking about it in Nov 2025 is strong evidence AI companies couldn’t yet rule out that it might work. The idea itself is obvious enough, but that’s less significant as evidence for its prospects.
I agree that it requires a lot of compute, but I think that misunderstands the objection. My claim is that for any level of compute, scaling parameters or training epochs using existing pretraining recipes will be more compute efficient than RLVR for the task of next token prediction. One reason is that by scaling models you can directly optimize the cross entropy objective through gradient descent, but with RLVR you have to sample intermediate tokens, and optimizing over these discrete tokens is difficult and inefficient. That being said, I could imagine there being some other objective besides next token prediction for which RLVR could have an advantage over pretraining. This is what I imagine Łukasz is working on.