In contrast, continual learning would in the near term likely only perform at the level of in-context learning, which is incapable of things like learning to play chess at the level of the most capable humans.
I think this particular task just takes a lot of compute to learn, relative to other RLVR tasks. This is especially true if you are training with RLVR and not MCTS self-play methods, which are much more sample efficient.
The thing that can’t learn to play good chess is in-context learning; transporting this via the analogy with continual learning (in the forms it’s more likely to take soon) is an additional step in the argument. My guess is that the “continual learning” currently rumored to be getting closer to production is more about unbounded context than about updating model weights, mostly preserving the quality of in-context learning rather than fundamentally improving on it.
Continual learning as unbounded context might introduce true (block-level) recurrence of some kind (as opposed to only KV-cache, where at each step the computation must move up the layers), which also makes treating the recurrent state as model parameters a more natural framing. As blocks of context are processed, the recurrent state is updated, and so in-context learning updates the parameters of the recurrent state in the same way that pretraining updates the parameters of the main model. LoRA parameters over the main model also fit this description (as recurrent state), if they change as the context is processed. Modern hardware could probably support recurrent states with 85 GB of parameters for $10 per million tokens[1] (in additional cost, individually for each request), so these recurrent states could get to the size of the smaller models, capable of holding arbitrary skills.
But there is no method to train the updating of the recurrent states (on additional context) so that they would work as well as RLVR-trained models. In principle, this architecture might enable in-context learning to be capable of something like that, with recurrent state representing the learned skills, and the main model representing the learning process that creates recurrent states based on context, that hold the skills relevant to the context. But in practice, we only get the kind of in-context learning that’s trained by pretraining, and it’s weaker than pretraining itself at comprehending novel concepts, even though the sample efficiency of in-context learning is much better than that of pretraining. Being already weaker than pretraining, in-context learning is vastly weaker than RLVR at the strength of skills it instills. It’s not going to be relevant for any AGI thresholds without a better training method.
GB300 HBM bandwidth is 8 TB/s per package/chip. With chips at about $3.4 per hour ($12bn for a 400K chip datacenter per year), $10 buys 10.6e3 seconds. With 10.6e3 chip-seconds, 8 TB/s can read 85 GB a million times (for a million tokens).
I think this particular task just takes a lot of compute to learn, relative to other RLVR tasks. This is especially true if you are training with RLVR and not MCTS self-play methods, which are much more sample efficient.
The thing that can’t learn to play good chess is in-context learning; transporting this via the analogy with continual learning (in the forms it’s more likely to take soon) is an additional step in the argument. My guess is that the “continual learning” currently rumored to be getting closer to production is more about unbounded context than about updating model weights, mostly preserving the quality of in-context learning rather than fundamentally improving on it.
Continual learning as unbounded context might introduce true (block-level) recurrence of some kind (as opposed to only KV-cache, where at each step the computation must move up the layers), which also makes treating the recurrent state as model parameters a more natural framing. As blocks of context are processed, the recurrent state is updated, and so in-context learning updates the parameters of the recurrent state in the same way that pretraining updates the parameters of the main model. LoRA parameters over the main model also fit this description (as recurrent state), if they change as the context is processed. Modern hardware could probably support recurrent states with 85 GB of parameters for $10 per million tokens [1] (in additional cost, individually for each request), so these recurrent states could get to the size of the smaller models, capable of holding arbitrary skills.
But there is no method to train the updating of the recurrent states (on additional context) so that they would work as well as RLVR-trained models. In principle, this architecture might enable in-context learning to be capable of something like that, with recurrent state representing the learned skills, and the main model representing the learning process that creates recurrent states based on context, that hold the skills relevant to the context. But in practice, we only get the kind of in-context learning that’s trained by pretraining, and it’s weaker than pretraining itself at comprehending novel concepts, even though the sample efficiency of in-context learning is much better than that of pretraining. Being already weaker than pretraining, in-context learning is vastly weaker than RLVR at the strength of skills it instills. It’s not going to be relevant for any AGI thresholds without a better training method.
GB300 HBM bandwidth is 8 TB/s per package/chip. With chips at about $3.4 per hour ($12bn for a 400K chip datacenter per year), $10 buys 10.6e3 seconds. With 10.6e3 chip-seconds, 8 TB/s can read 85 GB a million times (for a million tokens).