We Need To Know About Continual Learning

(Original title: The Continual Learning Overhang, but too many Overhang posts)

TL;DR Continual learning could lead to emergent causal understanding or agency and we should probably study it now before it’s too late.

Current LLMs are very impressive but suffer from a few key missing pieces (Sarah Constantin):

  • They inherently lack agency. They can be put within a loop of control code that makes them more agentic but it feels like this won’t really take off.

  • They are missing causal modeling to some extent. It’s just not how they are trained. They may get a sense of the causal structure of fiction but it’s not quite the same. They never train on interacting and have no inherent sense of “making an intervention”.

Sarah argues that the current approaches will be very unlikely to spontaneously develop these characteristics, and that it would require a ground level rethinking of how AI is done. I am not so convinced we’ve seen the full potential of the path we are on.


I think that we have yet to explore (publicly?) the potential of “switching on” backpropagation/​training while in inference mode. Most models have a clean separation of “train” and “inference”. Inference applies the model to generate token by token as it goes along, but it is no longer learning.

Why I am skeptical of agency in current models /​ AutoGPT

AutoGPT is very interesting, but seems to be plagued by trouble actually completing tasks. It may improve with better code & prompts, but I still suspect it will miss the mark.

The context window is a very strong handicap in practice. To be a proper agent, one must have a long term goal, and a good explanation of the current state of the world. Without updating the weights of the model, this must fit inside the context window. Why this is suboptimal:

  • The context window itself in theory can hold several KB (maybe even 32KB!), which is not bad, but it is also in word form and may not necessarily be efficiently encodable or decodable.

  • Finding ways to track state between calls to GPT-4 is not even trained, so it (a) relies on human trial and error, and (b) on the model side, the kind of inputs expected are always in natural language, so no special efficient encoding can emerge (other than that weird emoji soup).

  • In any case, it seems very unlikely that even a human could be an effective agent at anything, if they could only remember things in 5 minute chunks and then forget everything after that.

Relationship of Continual Learning to Context Window size

The context window is currently the only way a model can preserve knowledge. Most simply, with continual learning, we can have data transfer via the weights of the model.

Of course, in practice, does updating the weights really allow for efficient data transfer? It remains to be seen. However, we have some intuition why it might:

  • The biggest LLMs are very good at memorization. How many times the data points have been seen, and what was the learning rate at that point are unknowns at this points.

  • There were papers that came out recently arguing that in-context learning bears striking similarity to backpropagation. A great LW post by Blaine analyzes them, and sheds some light on why that analogy is not perfect. Still, it makes one wonder, if in-context learning behaves similarly to backpropagation, is the reverse true? Does backpropagation behave similarly to in-context learning?

Why not just increase the context window size?


The main reason why is that we need to know how much capability is being left on the table because of our current setup.

Long context window size is another clear direction to look at, but there is already a ton of research going on (Hyena, Survey of methods). My intuitions for why a continual learning approach would be more scalable:

  • Limiting the quadratic nature of attention seems to impact performance in practice

  • Looking at short term input quadratically is OK, as long as we look at new input quadratically and combine it with long term memory.


That said, fast attention would probably be invaluable in continuous time-domain cases because even a “short time window” in a robotics case could consist of a thousand frames (30 fps * 30 seconds) and however many tokens it takes to encode a frame.

Why it might not work


It may be the case that this is a very hard problem. In the ML models I have trained (up to 500M parameters), typically the LR is annealed and training converges. The examples seen towards the end are usually weighted much less. (Typically the examples seen at the end are repeats, assuming you do more than 1 epoch.) Trying to continually train the model (as opposed to fine-tune on a separate task) is a bit finicky. I would wonder how high to set the LR, if there would need to be experience replay, etc. Also, catastrophic forgetting seems to be a real issue.


These issues plague smaller models to a large extent. But, we do not know if larger models might be more robust to these issues. Large models seem to be capable of learning abilities smaller ones are not (example: learning from natural language feedback).

Why should we do it?

Without continual learning, the model never learns to make an intervention and see what happens. The model simulates but can never internally develop & store a sense of self. Maybe it will be harder than just “turning on backprop”, but what if it isn’t much more complex than that? Do we want to know? I think we want to know.

We are still at the point where GPT-4 seems to have enough blind spots that we can still “switch it off”. If this line of intervention can strengthen the capabilities of the model, we should find out now rather than later on, when we have an even more powerful model. Supposing we have GPT-7 or something, capable of superhuman performance on a variety of tasks, but still reasonably docile and not very capable of long term planning. I would not want to switch that one “on”.

Currently OpenAI is the only company that has such a powerful model whose output demonstrates non-shallow understanding of the world. Ideally, they should experiment along these lines in collaboration with ARC and publish their findings. I would also be curious if anyone can get LLaMa to work doing this, and even a negative result would be an interesting finding.

Or, Can this Lead to a Concrete Policy Proposal

As Sarah points out, doing interventions in the real world (with robotics) would be extremely expensive, so we can probably stick to chatbots for now. One question comes up though, is, does this model need millions of interactions/​interventions to really learn how to do this? Or is it more sample efficient?


We do have an example where we can learn from millions of chat interactions. All ChatGPT conversation histories! If continual learning turns out to be quite powerful, would we eventually want to discourage companies from training on the outputs of their own raw chat logs for models larger than GPT-4?

One could even argue that we don’t want this already, due to privacy concerns and potential data leakage between chat sessions. (Despite me calling for research into this, doing it on massive conversation data seems like a bad idea.)