I’m not commenting on whether we should think of actual frontier LLMs (not just pretrained base models) as predominantly powered by imitation learning
I’m confused about what your post is saying then. You say “LLMs” throughout, not “base models”. So is your post about base models only, or also about LLMs that have undergone post-training?
And if the latter, why talk as though LLMs have only undergone imitation learning, if they’ve also undergone RL?
Hmm, good point, I guess I was a bit sloppy in jumping around between a couple different things that I believe, instead of keeping the argument more tight and precise.
Another thing I believe is: “You can’t imitation-learn how to continual-learn”. This is independent of how to think about LLM post-training. I regularly come across people who disagree, on a conceptual level, with this claim, so it seemed worth sharing. Indeed, I now know that there’s a whole little subfield of “algorithm distillation” and “in-context RL”, and my claim (having now read three such papers, see other comments on this page) is that this whole subfield is a dumpster fire where the big idea doesn’t really work but people keep publishing misleading importance-hacked papers anyway.
Another thing I believe is: “You can’t meta-learn how to continual-learn”, which is a more general claim because it includes RLVR. This stronger claim is actually what follows from the boldface sentence in the post: “The only practical way to know what happens after millions of steps of some scaled-up continual learning algorithm is to actually do millions of steps of that same scaled-up continual learning algorithm, with actual weights getting actually changed in specifically-designed ways via PyTorch code. And then that’s the scaled-up learning algorithm you’re running. Which means you’re not doing [some different learning algorithm].”
Another thing I believe is: “If you put lots of interrelated complex concepts, none of which appear anywhere in the pretraining data, into the context window, then LLMs would crash and burn; rather, the only way for an LLM to fluently use a set of concepts is for all (or at least almost all) of those concepts to be in the weights, not the context window, because they were already used properly a bunch of times in the training data.” I allude to that in the post and elaborate on it in this other comment. This implies that context windows and scratchpads cannot substitute for weight-updates, and that a “country of geniuses in a datacenter” (who would presumably be inventing entirely new fields of science etc.) cannot consist of LLMs with very long context windows in a sealed box for the equivalent of 100 years with no human intervention.
Another thing I believe is: there’s no way to close the loop such that a closed system of LLMs can come up with new useful concepts and get those concepts into their own weights, e.g. open-ended self-distillation setups won’t work on LLMs. But that’s definitely off-topic for this OP. Self-distillation setups would be a bona fide continual learning algorithm, by the standards of this OP, e.g. there’s PyTorch code for continual weight updates. Whether that setup would actually work in practice, and how far it would go, are a different question.
So the main points of this OP are basically 2, 3, and 4, which are all pretty related. Plus the stuff about how to think about continual learning in general.
Another thing I believe is: there’s no way to close the loop such that a closed system of LLMs can come up with new useful concepts and get those concepts into their own weights, e.g. open-ended self-distillation setups won’t work on LLMs.
Why not? The success of this architecture published two days ago (and its seeming resemblance to how human mathematical progress happens) updates me towards thinking such a thing would work, even though it probably doesn’t demonstrate “coming up with new useful concepts”.
its seeming resemblance to how human mathematical progress happens
Well, one important-to-me disanalogy is that they used the Lean proof-assistant as ground truth for an LLM’s purported proof being valid or not. Whereas human mathematical progress obviously does not require proof-assistants—humans were doing math long before proof-assistants existed. (More on this in §1 of my post “Sharp Left Turn” discourse: An opinionated review.)
I’m confused about what your post is saying then. You say “LLMs” throughout, not “base models”. So is your post about base models only, or also about LLMs that have undergone post-training?
And if the latter, why talk as though LLMs have only undergone imitation learning, if they’ve also undergone RL?
Hmm, good point, I guess I was a bit sloppy in jumping around between a couple different things that I believe, instead of keeping the argument more tight and precise.
One thing I believe is: “LLMs are predominantly powered by imitation learning”. I didn’t argue for that in the post, but my argument would be basically this comment + one more paper along the same lines + “Most Algorithmic Progress is Data Progress” (+ further discussion in The nature of LLM algorithmic progress §1.4). I don’t feel super-duper strongly and am not defining what “predominantly” means here in any case.
Another thing I believe is: “You can’t imitation-learn how to continual-learn”. This is independent of how to think about LLM post-training. I regularly come across people who disagree, on a conceptual level, with this claim, so it seemed worth sharing. Indeed, I now know that there’s a whole little subfield of “algorithm distillation” and “in-context RL”, and my claim (having now read three such papers, see other comments on this page) is that this whole subfield is a dumpster fire where the big idea doesn’t really work but people keep publishing misleading importance-hacked papers anyway.
Another thing I believe is: “You can’t meta-learn how to continual-learn”, which is a more general claim because it includes RLVR. This stronger claim is actually what follows from the boldface sentence in the post: “The only practical way to know what happens after millions of steps of some scaled-up continual learning algorithm is to actually do millions of steps of that same scaled-up continual learning algorithm, with actual weights getting actually changed in specifically-designed ways via PyTorch code. And then that’s the scaled-up learning algorithm you’re running. Which means you’re not doing [some different learning algorithm].”
Another thing I believe is: “If you put lots of interrelated complex concepts, none of which appear anywhere in the pretraining data, into the context window, then LLMs would crash and burn; rather, the only way for an LLM to fluently use a set of concepts is for all (or at least almost all) of those concepts to be in the weights, not the context window, because they were already used properly a bunch of times in the training data.” I allude to that in the post and elaborate on it in this other comment. This implies that context windows and scratchpads cannot substitute for weight-updates, and that a “country of geniuses in a datacenter” (who would presumably be inventing entirely new fields of science etc.) cannot consist of LLMs with very long context windows in a sealed box for the equivalent of 100 years with no human intervention.
Another thing I believe is: there’s no way to close the loop such that a closed system of LLMs can come up with new useful concepts and get those concepts into their own weights, e.g. open-ended self-distillation setups won’t work on LLMs. But that’s definitely off-topic for this OP. Self-distillation setups would be a bona fide continual learning algorithm, by the standards of this OP, e.g. there’s PyTorch code for continual weight updates. Whether that setup would actually work in practice, and how far it would go, are a different question.
So the main points of this OP are basically 2, 3, and 4, which are all pretty related. Plus the stuff about how to think about continual learning in general.
Why not? The success of this architecture published two days ago (and its seeming resemblance to how human mathematical progress happens) updates me towards thinking such a thing would work, even though it probably doesn’t demonstrate “coming up with new useful concepts”.
I’ll bow out of that argument. Time will tell!
Well, one important-to-me disanalogy is that they used the Lean proof-assistant as ground truth for an LLM’s purported proof being valid or not. Whereas human mathematical progress obviously does not require proof-assistants—humans were doing math long before proof-assistants existed. (More on this in §1 of my post “Sharp Left Turn” discourse: An opinionated review.)