The only way LLMs learn deep skills is with RLVR, and it’s not automated for novel skills. There is room for about a billion RLVR tasks with 2026-2027 compute, and largely automating their creation makes LLMs more cognitively self-sufficient. This milestone might even be called “prosaic RSI”, even though it’s not true RSI that develops fundamentally new ways of implementing cognition.
One gigawatt of compute costs about $12bn per year, so 3 months will cost $3bn. For a large model, a million output tokens might cost $10 for the API provider, so 3 months on 1 GW of compute produces 300T output tokens. An RLVR task might run for 20K tokens, 16 times to produce outcomes with different levels of success in completing the task. This is about 300K tokens per task, so the 300T tokens suffice for a billion tasks.
Currently, there wouldn’t be this many unique RLVR tasks, but it would probably be useful if there were, and that can’t be done manually. Using many more tasks (and a general method for choosing them) should also help with the jaggedness of RLVR-trained capabilities.
In contrast, continual learning would in the near term likely only perform at the level of in-context learning, which is incapable of things like learning to play chess at the level of the most capable humans. And automating RLVR should be much easier in the context where it’s currently practiced, within AI companies as they are training the next version of a model. Automating RLVR so fully that it can be used post-deployment would be much more difficult, and also issues with sample efficiency would be more relevant there. So some gesturing at “RSI” in the very near term (or at making a feature-complete “AGI” that learns “on its own”) might be about significantly greater (even if not full) automation of RLVR task creation during model training.
I think “automation of RLVR task creation during model training” is much the same as “self play and autocurricula” here. I agree with calling this “prosaic RSI” because it could lead to something that is superhuman on a variety of tasks, even though it doesn’t improve (the generality of) the underlying learning algorithm (like architecture, optimizer or objective function).
Automating RLVR so fully that it can be used post-deployment would be much more difficult, and also issues with sample efficiency would be more relevant there.
Yes, and another problem with continual (test time) learning is that it likely doesn’t mesh well with the centralized nature of cloud-based AI services where millions of people use the same weights.
Training that would normally want to use full weights usually works almost as well with LoRA when you don’t need too many training steps. As with KV-cache, you can maintain separate LoRA weights for each request/user. It’s the next level of difficulty to get task-specific training to work over so many steps that LoRA (as opposed to full weights) starts becoming a problem. Recurrent state could also probably play this role, and doesn’t need to be nearly as large as the main model, the same as with KV-cache and LoRA weights.
In contrast, continual learning would in the near term likely only perform at the level of in-context learning, which is incapable of things like learning to play chess at the level of the most capable humans.
I think this particular task just takes a lot of compute to learn, relative to other RLVR tasks. This is especially true if you are training with RLVR and not MCTS self-play methods, which are much more sample efficient.
The thing that can’t learn to play good chess is in-context learning; transporting this via the analogy with continual learning (in the forms it’s more likely to take soon) is an additional step in the argument. My guess is that the “continual learning” currently rumored to be getting closer to production is more about unbounded context than about updating model weights, mostly preserving the quality of in-context learning rather than fundamentally improving on it.
Continual learning as unbounded context might introduce true (block-level) recurrence of some kind (as opposed to only KV-cache, where at each step the computation must move up the layers), which also makes treating the recurrent state as model parameters a more natural framing. As blocks of context are processed, the recurrent state is updated, and so in-context learning updates the parameters of the recurrent state in the same way that pretraining updates the parameters of the main model. LoRA parameters over the main model also fit this description (as recurrent state), if they change as the context is processed. Modern hardware could probably support recurrent states with 85 GB of parameters for $10 per million tokens[1] (in additional cost, individually for each request), so these recurrent states could get to the size of the smaller models, capable of holding arbitrary skills.
But there is no method to train the updating of the recurrent states (on additional context) so that they would work as well as RLVR-trained models. In principle, this architecture might enable in-context learning to be capable of something like that, with recurrent state representing the learned skills, and the main model representing the learning process that creates recurrent states based on context, that hold the skills relevant to the context. But in practice, we only get the kind of in-context learning that’s trained by pretraining, and it’s weaker than pretraining itself at comprehending novel concepts, even though the sample efficiency of in-context learning is much better than that of pretraining. Being already weaker than pretraining, in-context learning is vastly weaker than RLVR at the strength of skills it instills. It’s not going to be relevant for any AGI thresholds without a better training method.
GB300 HBM bandwidth is 8 TB/s per package/chip. With chips at about $3.4 per hour ($12bn for a 400K chip datacenter per year), $10 buys 10.6e3 seconds. With 10.6e3 chip-seconds, 8 TB/s can read 85 GB a million times (for a million tokens).
The only way LLMs learn deep skills is with RLVR, and it’s not automated for novel skills. There is room for about a billion RLVR tasks with 2026-2027 compute, and largely automating their creation makes LLMs more cognitively self-sufficient. This milestone might even be called “prosaic RSI”, even though it’s not true RSI that develops fundamentally new ways of implementing cognition.
One gigawatt of compute costs about $12bn per year, so 3 months will cost $3bn. For a large model, a million output tokens might cost $10 for the API provider, so 3 months on 1 GW of compute produces 300T output tokens. An RLVR task might run for 20K tokens, 16 times to produce outcomes with different levels of success in completing the task. This is about 300K tokens per task, so the 300T tokens suffice for a billion tasks.
Currently, there wouldn’t be this many unique RLVR tasks, but it would probably be useful if there were, and that can’t be done manually. Using many more tasks (and a general method for choosing them) should also help with the jaggedness of RLVR-trained capabilities.
In contrast, continual learning would in the near term likely only perform at the level of in-context learning, which is incapable of things like learning to play chess at the level of the most capable humans. And automating RLVR should be much easier in the context where it’s currently practiced, within AI companies as they are training the next version of a model. Automating RLVR so fully that it can be used post-deployment would be much more difficult, and also issues with sample efficiency would be more relevant there. So some gesturing at “RSI” in the very near term (or at making a feature-complete “AGI” that learns “on its own”) might be about significantly greater (even if not full) automation of RLVR task creation during model training.
I think “automation of RLVR task creation during model training” is much the same as “self play and autocurricula” here. I agree with calling this “prosaic RSI” because it could lead to something that is superhuman on a variety of tasks, even though it doesn’t improve (the generality of) the underlying learning algorithm (like architecture, optimizer or objective function).
Yes, and another problem with continual (test time) learning is that it likely doesn’t mesh well with the centralized nature of cloud-based AI services where millions of people use the same weights.
Training that would normally want to use full weights usually works almost as well with LoRA when you don’t need too many training steps. As with KV-cache, you can maintain separate LoRA weights for each request/user. It’s the next level of difficulty to get task-specific training to work over so many steps that LoRA (as opposed to full weights) starts becoming a problem. Recurrent state could also probably play this role, and doesn’t need to be nearly as large as the main model, the same as with KV-cache and LoRA weights.
I think this particular task just takes a lot of compute to learn, relative to other RLVR tasks. This is especially true if you are training with RLVR and not MCTS self-play methods, which are much more sample efficient.
The thing that can’t learn to play good chess is in-context learning; transporting this via the analogy with continual learning (in the forms it’s more likely to take soon) is an additional step in the argument. My guess is that the “continual learning” currently rumored to be getting closer to production is more about unbounded context than about updating model weights, mostly preserving the quality of in-context learning rather than fundamentally improving on it.
Continual learning as unbounded context might introduce true (block-level) recurrence of some kind (as opposed to only KV-cache, where at each step the computation must move up the layers), which also makes treating the recurrent state as model parameters a more natural framing. As blocks of context are processed, the recurrent state is updated, and so in-context learning updates the parameters of the recurrent state in the same way that pretraining updates the parameters of the main model. LoRA parameters over the main model also fit this description (as recurrent state), if they change as the context is processed. Modern hardware could probably support recurrent states with 85 GB of parameters for $10 per million tokens [1] (in additional cost, individually for each request), so these recurrent states could get to the size of the smaller models, capable of holding arbitrary skills.
But there is no method to train the updating of the recurrent states (on additional context) so that they would work as well as RLVR-trained models. In principle, this architecture might enable in-context learning to be capable of something like that, with recurrent state representing the learned skills, and the main model representing the learning process that creates recurrent states based on context, that hold the skills relevant to the context. But in practice, we only get the kind of in-context learning that’s trained by pretraining, and it’s weaker than pretraining itself at comprehending novel concepts, even though the sample efficiency of in-context learning is much better than that of pretraining. Being already weaker than pretraining, in-context learning is vastly weaker than RLVR at the strength of skills it instills. It’s not going to be relevant for any AGI thresholds without a better training method.
GB300 HBM bandwidth is 8 TB/s per package/chip. With chips at about $3.4 per hour ($12bn for a 400K chip datacenter per year), $10 buys 10.6e3 seconds. With 10.6e3 chip-seconds, 8 TB/s can read 85 GB a million times (for a million tokens).