I think I might’ve been unclear. My understanding is:
An LLM can be fine-tuned to perform low-level robotic actions, like adjusting the position of a gripper, using sets of action tokens (as distinct from language or vision tokens).
OpenVLA might be a better example than RT-2, given that it’s explicitly fine-tuned from LLaMA in roughly the manner I described.
The executor, in this experiment, is a hard-coded script that responds to e.g. “rotate” or “go forward” with a fixed action.
I.E. there isn’t a neural network taking in “go forward” and outputting a series of motor actions, there’s something akin to an if-then statement that identifies “go forward” in the LLM output and engages the left and right motors at equal power for some predefined or LLM-specified amount of time.
A VLA of the kind I was thinking of would be both the Orchestrator and the Executor, both encapsulating high-level ‘common sense’ knowledge that it uses to solve problems, and also directly specifying the low-level motor actions that carry out its tasks.
In practice, this would contrast approaches that give the LLM a Python console with a specialized API designed to facilitate robotics tasks (Voxposer, for instance, gives a ‘blind’ LLM an API that lets it use a vision-language model guide its movements automatically), or, like I think you’re describing, pair a pure LLM or a VLM tasked with high-level control with a VLA model tasked with implementing the strategies it proposes. The advantage, here, would be that there’s no bottleneck of information between the planning and implementation modules, which I’ve noticed is the source of a decent share of failure-cases in practical settings.
RT-2 (the paper you cited) is a VLA, not LLM. VLAs are what the “executor” in our diagram uses.
I think I might’ve been unclear. My understanding is:
An LLM can be fine-tuned to perform low-level robotic actions, like adjusting the position of a gripper, using sets of action tokens (as distinct from language or vision tokens).
OpenVLA might be a better example than RT-2, given that it’s explicitly fine-tuned from LLaMA in roughly the manner I described.
The executor, in this experiment, is a hard-coded script that responds to e.g. “rotate” or “go forward” with a fixed action.
I.E. there isn’t a neural network taking in “go forward” and outputting a series of motor actions, there’s something akin to an if-then statement that identifies “go forward” in the LLM output and engages the left and right motors at equal power for some predefined or LLM-specified amount of time.
A VLA of the kind I was thinking of would be both the Orchestrator and the Executor, both encapsulating high-level ‘common sense’ knowledge that it uses to solve problems, and also directly specifying the low-level motor actions that carry out its tasks.
In practice, this would contrast approaches that give the LLM a Python console with a specialized API designed to facilitate robotics tasks (Voxposer, for instance, gives a ‘blind’ LLM an API that lets it use a vision-language model guide its movements automatically), or, like I think you’re describing, pair a pure LLM or a VLM tasked with high-level control with a VLA model tasked with implementing the strategies it proposes. The advantage, here, would be that there’s no bottleneck of information between the planning and implementation modules, which I’ve noticed is the source of a decent share of failure-cases in practical settings.