I would quibble with the framing of this piece—as you note, the problem is not imitation learning itself (in principle, it can work!) but the limits of the standard transformer architecture.
A specific limit you could point to, to make this argument stronger, is that a depth-D transformer can implement at most O(D) steps of gradient descent, no matter how long its context window is. I think this is underappreciated. By design, transformers cannot perform sequential computation along the input dimension, only the depth dimension. This is their main tradeoff versus RNNs; it’s what allows transformer training to be parallelized. The tradeoff is that transformers can only implement algorithms that fit within this parallelization structure (formally, a transformer forward pass is in the complexity class TC0 - a constant-depth threshold circuit).
But learning in general is an inherently sequential process—you evaluate your current hypothesis, update it (take a gradient step), re-evaluate, update again, etc. So a transformer forward pass cannot even in principle emulate long-running gradient descent (or RL, etc.) across its context window. It just doesn’t have the sequential depth. This is true no matter whether it’s trained with imitation learning, RL, or God Himself setting the weights; it’s just an inherent limit of the architecture.
On the other hand, a transformer augmented with the right kind of recurrent state could in principle implement long-term learning, although how to train that is an open question. It’s not obvious how far you can get doing imitation learning from short contexts, but it’s also not obvious that this can’t work: standard RL algorithms generally apply the same update rule over and over, and given enough data you might hope to grok such an update rule (which could then generalize to long contexts). So again, I think the stronger claim is about the transformer architecture, not imitation learning per se.
I would quibble with the framing of this piece—as you note, the problem is not imitation learning itself (in principle, it can work!) but the limits of the standard transformer architecture.
A specific limit you could point to, to make this argument stronger, is that a depth-D transformer can implement at most O(D) steps of gradient descent, no matter how long its context window is. I think this is underappreciated. By design, transformers cannot perform sequential computation along the input dimension, only the depth dimension. This is their main tradeoff versus RNNs; it’s what allows transformer training to be parallelized. The tradeoff is that transformers can only implement algorithms that fit within this parallelization structure (formally, a transformer forward pass is in the complexity class TC0 - a constant-depth threshold circuit).
But learning in general is an inherently sequential process—you evaluate your current hypothesis, update it (take a gradient step), re-evaluate, update again, etc. So a transformer forward pass cannot even in principle emulate long-running gradient descent (or RL, etc.) across its context window. It just doesn’t have the sequential depth. This is true no matter whether it’s trained with imitation learning, RL, or God Himself setting the weights; it’s just an inherent limit of the architecture.
On the other hand, a transformer augmented with the right kind of recurrent state could in principle implement long-term learning, although how to train that is an open question. It’s not obvious how far you can get doing imitation learning from short contexts, but it’s also not obvious that this can’t work: standard RL algorithms generally apply the same update rule over and over, and given enough data you might hope to grok such an update rule (which could then generalize to long contexts). So again, I think the stronger claim is about the transformer architecture, not imitation learning per se.