[Question] Why no major LLMs with memory?

Kaj_Sotala28 Mar 2023 16:34 UTC

42 points

AI Language Models (LLMs)Machine Learning (ML)

One thing that I’m slightly puzzled by is that an obvious improvement to LLMs would be adding some kind of long-term memory that would allow them to retain more information than fits their context window. Naively, I would imagine that even just throwing some recurrent neural net layers in there would be better than nothing?

But while I’ve seen LLM papers that talk about how they’re multimodal or smarter than before, I don’t recall seeing any widely-publicized model that would have extended the memory beyond the immediate context window, and that confuses me.

What links here?

Memory Persistence within Conversation Threads with Multimodal LLMS by sjay8 (30 Mar 2025 7:16 UTC; 4 points)

Kaj_Sotala28 Mar 2023 16:34 UTC

42 points

15 comments1 min readLW link

AI Language Models (LLMs)Machine Learning (ML)

Carl Feynman 28 Mar 2023 17:14 UTC
29 points
9
Models with long-term memory are very hard to train. Instead of being able to compute a weight update after seeing a single input, you have to run in a long loop of ”put thing in memory, take thing out, compute with it, etc” before you can compute a weight update. It’s not a priori impossible, but nobody’s managed to get it to work. Evolution has figured out how to do it because it’s willing to waste an entire lifetime to get a single noisy update.
People have been working on this for years. It’s remarkable (in retrospect, to me) that we’ve gotten as far as we have without long term memory.
- jacopo 28 Mar 2023 19:42 UTC
  4 points
  0
  Parent
  Isn’t that the point of the original transformer paper? I have not actually read it, just going by summaries read here and there.
  
  If I don’t misremember RNN should be expecially difficult to train in parallel
  - Carl Feynman 28 Mar 2023 23:41 UTC
    6 points
    0
    Parent
    Transformers take O(n^2) computation for a context window of size n, because they effectively feed everything inside the context window to every layer. It provides the benefits of a small memory, but it doesn’t scale. It has no way of remembering things from before the context window, so it’s like a human with a busted hippocampus (Korsakoff’s syndrome) who can‘t make new memories.
- Noosphere89 28 Mar 2023 17:29 UTC
  3 points
  2
  Parent
  I suspect much of the reason we didn’t need much long term memory is that we can increase the context window pretty cheaply, thus long-term memory is deprioritized.
Lone Pine 28 Mar 2023 18:11 UTC
11 points
0
There is an architecture called RWKV which claims to have an ‘infinite’ context window (since it is similar to an RNN). It claims to be competitive with GPT-3. I have no idea whether this is worth taking seriously or not.
- abhayesian 28 Mar 2023 22:15 UTC
  9 points
  0
  Parent
  I don’t think it’s fair for them to claim that the model has an infinite context length. It appears that they can train the model as a transformer, but can turn the model into an RNN at inference time. While the RNN doesn’t have a context length limit as the transformer does, I doubt it will perform well on contexts longer than it has seen during training. There may also be limits to how much information can be stored in the hidden state, such that the model has a shorter effective context length than current SOTA LLMs.
- bvbvbvbvbvbvbvbvbvbvbv 29 Mar 2023 8:00 UTC
  3 points
  0
  Parent
  Two links related to RWKV to know more :
  
  https://johanwind.github.io/2023/03/23/rwkv_overview.html
  
  https://johanwind.github.io/2023/03/23/rwkv_details.html
Ustice 28 Mar 2023 20:19 UTC
7 points
1
Given that LLM’s can use tools, it sounds like a traditional database might be able to be used. The data would still have to fit inside the context window, along with the generated continuation prompt, but that might work for a lot of cases.
- hold_my_fish 29 Mar 2023 3:54 UTC
  4 points
  0
  Parent
  I could also imagine this working without explicit tool use. There are already systems for querying corpuses (using embeddings to query vector databases, from what I’ve seen). Perhaps the corpus could be past chat transcripts, chunked.
  I suspect the trickier part would be making this useful enough to justify the additional computation.
abhayesian 28 Mar 2023 22:27 UTC
4 points
0
One thing that comes to mind is DeepMind’s Adaptive Agents team using Transformer-XL, which can attend to data outside the current context window. I think there was speculation that GPT-4 may also be a Transformer-XL, but I’m not sure how to verify that.
- Oliver Daniels 29 Mar 2023 15:53 UTC
  1 point
  0
  Parent
  Briefly read a Chat-GPT description of Transformer-XL—is this essentially long term memory? Are there computations an LSTM could do that a Transformer-XL couldn’t?
  - abhayesian 29 Mar 2023 19:25 UTC
    2 points
    0
    Parent
    There is still technically a limit to how far back a Transformer-XL can see since each layer can only attend to previous keys/values computed by that layer. As a result, the receptive field of layer L can only be as wide as the last L context windows. I guess this means that there might be some things that LSTMs can do that Transformer-XL can’t, but this can be fixed with a couple of minor modifications to Transformer-XL. For example, this paper fixes the problem by allowing layers to attend to the outputs of later layers from previous context windows, which should make the receptive field (at least theoretically) infinitely long, meaning it should probably be able to do everything an LSTM can.
bvbvbvbvbvbvbvbvbvbvbv 29 Mar 2023 8:01 UTC
3 points
0
On mobile but FYI langchain implements some kind of memory.

Also, this other post might interest you. It’s about asking GPT to decide when to call a memory module to store data : https://www.lesswrong.com/posts/bfsDSY3aakhDzS9DZ/instantiating-an-agent-with-gpt-4-and-text-davinci-003
Ustice 28 Mar 2023 20:26 UTC
3 points
0
Given that we know that LLM’s can use tools, can traditional databases be used for long-term memory?
Bartlomiej Lewandowski 29 Mar 2023 13:28 UTC
2 points
1
I think there has been a lot of research in the past in this space. The first thing that popped into my mind was https://huggingface.co/docs/transformers/model_doc/rag
Currently, there are some approaches using langchain to persist the history of a conversation into an embeddings database, and retrieve the relevant parts performing a similar query / task.

No comments.