I’m glad you asked. I completely agree that nothing in the current LLM architecture prevents that technically and I expect that it will happen eventually.
The issue in the near future is practicality, because training models is currently—and will in the near future still be—very expensive. Inference is less expensive, but still so expensive that profit is only possible by serving the model statically (i.e., without changing its weights) to many clients, which amortizes the cost of training and inference.
These clients often rely heavily on models being static, because it makes its behavior predictable enough to be suitable for a production environment. For example, if you use a model for a chat bot on your company’s website, you wouldn’t want its personality to change based on what people say to it. We’ve seen that go wrong very quickly with Microsoft’s Twitter bot Tay.
It’s also a question whether you want your model to internalize new concepts (let’s just call it “continual learning”) based on everybody’s data or based on just your data. Using everybody’s data is more practical in the sense that you just update the one model that everybody uses (which is something that’s in a sense already happening when they move the cutoff date of the training data forward for the latest models), but it’s not something that users will necessarily be comfortable with. For example, users won’t want a model to leak their personal information to others. There are also legal barriers here, of course, especially with proprietary data.
People will probably be more comfortable with a model that updates just on their data, but that’s not practical (yet) in the sense that you would need the compute resources to be cheap enough to run an entire, slightly different model for each specific use case. It can already be done to some degree with fine-tuning, but that doesn’t change the weights of the entire model (that would be prohibitively expensive with current technology) and I don’t think this form of fine-tuning is able to implement continual learning effectively (but I’m happy to be proven wrong here).
Do you know about the double descent phenomenon? It contradicts the bias-variance tradeoff. Welch Labs did a very good breakdown recently.