I think I’m a little confused about the hypothesis space part. I agree it sounds implausible to run multiple learning algorithms in parallel within a transformer forward pass to find the best one, and the search space is really large.
But if we just ask about the hypothesis space for a moment: is it really practically impossible for a transformer forward pass to simulate a deep-Q style learning algorithm? Even with eg. 3-5 OOMs more compute than GPT-4.5?
I worry you could’ve made this same argument ten years ago for simulating human expert behavior over 8 hour time horizons — which involves some learning, eg navigating a new code base, checking code on novel unit tests. It’s shallow learning, sure. You don’t have to update your world model that much. But it’s not nothing, and ten years ago I probably would’ve been convinced that a transformer forward pass could never practically approximate it. Why is the deep Q style learning algorithm so much harder to simulate?
It feels like there’s some theoretical claim about complexity underlying your position: something like {whatever quasi-learning algorithm + heuristics an LLM uses to simulate 8 hours of SWE} is exponentially simpler than {any true continual learning algorithm}. (That’s why you’d need the hypercomputer, if I’m reading you right?) Could you spell that out more?
Even if you can simulate a continual learning algorithm within a transformer or other imitation learner, I agree that it feels like unnecessary complexity: why have a transformer simulate a neural net running some RL algorithm when you could just train the RL agent yourself?
I think I’m a little confused about the hypothesis space part. I agree it sounds implausible to run multiple learning algorithms in parallel within a transformer forward pass to find the best one, and the search space is really large.
But if we just ask about the hypothesis space for a moment: is it really practically impossible for a transformer forward pass to simulate a deep-Q style learning algorithm? Even with eg. 3-5 OOMs more compute than GPT-4.5?
I worry you could’ve made this same argument ten years ago for simulating human expert behavior over 8 hour time horizons — which involves some learning, eg navigating a new code base, checking code on novel unit tests. It’s shallow learning, sure. You don’t have to update your world model that much. But it’s not nothing, and ten years ago I probably would’ve been convinced that a transformer forward pass could never practically approximate it. Why is the deep Q style learning algorithm so much harder to simulate?
It feels like there’s some theoretical claim about complexity underlying your position: something like {whatever quasi-learning algorithm + heuristics an LLM uses to simulate 8 hours of SWE} is exponentially simpler than {any true continual learning algorithm}. (That’s why you’d need the hypercomputer, if I’m reading you right?) Could you spell that out more?
Even if you can simulate a continual learning algorithm within a transformer or other imitation learner, I agree that it feels like unnecessary complexity: why have a transformer simulate a neural net running some RL algorithm when you could just train the RL agent yourself?