Steven Byrnes comments on You can’t imitation-learn how to continual-learn

Steven Byrnes 22 Mar 2026 20:54 UTC
3 points
1
is it really practically impossible for a transformer forward pass to simulate a deep-Q style learning algorithm? Even with eg. 3-5 OOMs more compute than GPT-4.5?
I say yes. You left out an important part, here it is in italics: “is it really practically impossible for a transformer forward pass to simulate a deep-Q style learning algorithm churning for millions of steps?”
Yes, because an awful lot can happen in millions of steps, including things that build on each other in a serial way.
I worry you could’ve made this same argument ten years ago for simulating human expert behavior over 8 hour time horizons — which involves some learning, eg navigating a new code base, checking code on novel unit tests. It’s shallow learning, sure. You don’t have to update your world model that much. But it’s not nothing, and ten years ago I probably would’ve been convinced that a transformer forward pass could never practically approximate it.
I disagree that it should be called “learning” at all. It would be “learning” for a human in real life, but if you imagine a person who has read 2 billion lines of code [that’s the amount of GitHub code in The Pile … actually today’s LLMs probably see way more code than that], which would correspond to reading code 24 hours a day for 100 years, then I believe that such a person could do the METR 8 hour tasks without “learning” anything new whatsoever. You don’t need to “learn” new things to mix-and-match things you already know in novel ways—see my example here of “imagine a pink fuzzy microphone falling out of a helicopter into a football stadium full of bunnies”. And see also: related discussion here.
why have a transformer simulate a neural net running some RL algorithm when you could just train the RL agent yourself?
Yup, that’s my main point in this post, I expect that sooner or later somebody will invent real-deal continual learning, and it will look like a bona fide learning algorithm written in PyTorch with gradient descent steps and/or TD learning steps and/or whatever else, as opposed to (so-called) “in-context learning” or RAG etc.