Note that this is different from the (also very interesting) question of what LLMs, or the transformer architecture, are capable of accomplishing in a single forward pass. Here we’re talking about what they can do under typical auto-regressive conditions like chat.
I would appreciate if the community here could point me to research that agrees or disagrees with my claim and conclusions, below.
Claim: one pass through a transformer (of a given size) can only do a finite number of reasoning steps.
Therefore: If we want an agent that can plan over an unbounded number of steps (e.g. one that does tree-search), it will need some component that can do an arbitrary number of iterative or recursive steps.
Sub-claim: The above claim does not conflict with the Universal Approximation Theorem.
[Epistemic note: I’m going past my expertise here, just giving my own current understanding, I encourage people with more expertise on this (possibly including you) to correct me]
Therefore: If we want an agent that can plan over an unbounded number of steps (e.g. one that does tree-search)
As a side note, I’m not sure how tree search comes into play; in what way does tree search require unbounded steps that doesn’t apply equally to linear search?
If we want an agent that can plan over an unbounded number of steps...it will need some component that can do an arbitrary number of iterative or recursive steps.
No finite agent, recursive or otherwise, can plan over an unbounded number of steps in finite time, so it’s not immediately clear to me how iteration/recursion is fundamentally different in practice. I think the better comparison is the power of a transformer under auto-regressive conditions (note that intermediate steps don’t need to be shown to a user as output). The first paper linked above shows that given polynomial intermediate steps, transformers can handle exactly the class of problems that can be solved in polynomial time, aka P. So under those conditions they’re pretty strong and general (which certainly doesn’t mean that LLMs trained on next-token prediction are that strong and general).
One useful way to think about it, for me, is that auto-regression is a form of recursion, albeit one that’s bounded in current architecture by the context length.
Thanks for the references; I’ll need some time to review them. In the meanwhile, I’ll make some quick responses.
As a side note, I’m not sure how tree search comes into play; in what way does tree search require unbounded steps that doesn’t apply equally to linear search?
I intended tree search as just one example, since minimax tree search is a common example for game-based RL research.
No finite agent, recursive or otherwise, can plan over an unbounded number of steps in finite time...
In general, I agree. Though there are notable exceptions for cases such as (not mutually exclusive):
a closed form solution is found (for example, where a time-based simulation can calculate some quantity at an any arbitrary time step using the same amount of computation)
approximate solutions using a fixed number of computation steps are viable
a greedy algorithm can select the immediate next action that is equivalent to following a longer-term planning algorithm
… so it’s not immediately clear to me how iteration/recursion is fundamentally different in practice.
Yes, like I said above, I agree in general and see your point.
As I’m confident we both know, some algorithms can be written more compactly when recursion/iteration are available. I don’t know how much computation theory touches on this; i.e. what classes of problems this applies to and why. I would make an intuitive guess that it is conceptually related to my point earlier about closed-form solutions.
I would appreciate if the community here could point me to research that agrees or disagrees with my claim and conclusions, below.
Claim: one pass through a transformer (of a given size) can only do a finite number of reasoning steps.
Therefore: If we want an agent that can plan over an unbounded number of steps (e.g. one that does tree-search), it will need some component that can do an arbitrary number of iterative or recursive steps.
Sub-claim: The above claim does not conflict with the Universal Approximation Theorem.
[Epistemic note: I’m going past my expertise here, just giving my own current understanding, I encourage people with more expertise on this (possibly including you) to correct me]
Merrill and Sabharwal have done some very useful-seeming work on how strong transformers can be, both under auto-regressive conditions (‘The Expressive Power of Transformers with Chain of Thought’) and in a single forward pass with assumptions about precision (eg ‘The Parallelism Tradeoff: Limitations of Log-Precision Transformers’), although I haven’t gone through it in detail. Certainly it’s unambiguously true that there are limitations to what can be done in a single forward pass.
As a side note, I’m not sure how tree search comes into play; in what way does tree search require unbounded steps that doesn’t apply equally to linear search?
No finite agent, recursive or otherwise, can plan over an unbounded number of steps in finite time, so it’s not immediately clear to me how iteration/recursion is fundamentally different in practice. I think the better comparison is the power of a transformer under auto-regressive conditions (note that intermediate steps don’t need to be shown to a user as output). The first paper linked above shows that given polynomial intermediate steps, transformers can handle exactly the class of problems that can be solved in polynomial time, aka P. So under those conditions they’re pretty strong and general (which certainly doesn’t mean that LLMs trained on next-token prediction are that strong and general).
One useful way to think about it, for me, is that auto-regression is a form of recursion, albeit one that’s bounded in current architecture by the context length.
Thanks for the references; I’ll need some time to review them. In the meanwhile, I’ll make some quick responses.
I intended tree search as just one example, since minimax tree search is a common example for game-based RL research.
In general, I agree. Though there are notable exceptions for cases such as (not mutually exclusive):
a closed form solution is found (for example, where a time-based simulation can calculate some quantity at an any arbitrary time step using the same amount of computation)
approximate solutions using a fixed number of computation steps are viable
a greedy algorithm can select the immediate next action that is equivalent to following a longer-term planning algorithm
Yes, like I said above, I agree in general and see your point.
As I’m confident we both know, some algorithms can be written more compactly when recursion/iteration are available. I don’t know how much computation theory touches on this; i.e. what classes of problems this applies to and why. I would make an intuitive guess that it is conceptually related to my point earlier about closed-form solutions.
Totally, good point!