What kinds of tasks do you expect online continual learning to outcompete LLM agents with a database and great context engineering?
I’m looking for settings to study model drift as models pursue goals that are seemingly limited (perform badly) by the current LLM agent paradigm (static weights with knowledge cutoff).
For a specific guess, I’d say it’s tasks that are fairly simple by human standards, but ideosyncratic in details to that human or that business. They’re not going to do great context engineering, but they will put in a little time telling the agent what it’s doing wrong and how to do it better, like they’d train an assistant. The specificity is the big edge for limited continual learning over context engineering.
Before long, I expect even limited continual learning to outperform context engineering in pretty much every area, because it’s the model doing the work, not humans doing meticulous engineering for each task.
But we don’t yet have even limited continual learning in deployment. I remain a little confused why; I know working versions are in development, but there are hangups. Those include interference, but I wonder what else is preventing “just have the model think a bunch about what it’s learned about this task, produce some example context-response-pairs, and finetune on those” from working.
What kinds of tasks do you expect online continual learning to outcompete LLM agents with a database and great context engineering?
I’m looking for settings to study model drift as models pursue goals that are seemingly limited (perform badly) by the current LLM agent paradigm (static weights with knowledge cutoff).
Cart-pole balancing seems like a good toy case
For a specific guess, I’d say it’s tasks that are fairly simple by human standards, but ideosyncratic in details to that human or that business. They’re not going to do great context engineering, but they will put in a little time telling the agent what it’s doing wrong and how to do it better, like they’d train an assistant. The specificity is the big edge for limited continual learning over context engineering.
Before long, I expect even limited continual learning to outperform context engineering in pretty much every area, because it’s the model doing the work, not humans doing meticulous engineering for each task.
But we don’t yet have even limited continual learning in deployment. I remain a little confused why; I know working versions are in development, but there are hangups. Those include interference, but I wonder what else is preventing “just have the model think a bunch about what it’s learned about this task, produce some example context-response-pairs, and finetune on those” from working.
I outlined my take in