LLMs are not trained to be robots, and they will most likely never be tasked with low-level controls in robotics (i.e. generating long sequences of numbers for gripper positions and joint angles).
Perhaps I’m off-base; robotics isn’t my area, but I have read some papers that indicate that this is viable. This one in particular is pretty well-cited, and I’ve always suspected that a scale-up of training on video data of humans performing tasks could, with some creative pre-processing efforts to properly label the movements therein, get us something that could output series of limb/finger positions that would resolve a given natural language request. From there, we’d need some output postprocessing of the kind we’ve seen inklings of in numerous other LLM+robotics papers to get a roughly humanoid robot to match those movements.
The above is certainly simplified, and I suspect that much more intensive image preprocessing involving object detection and labelling would be necessary, but I do think that you could get a proof of concept along these lines with 2025 tech if you picked a relatively simple class of task.
Perhaps I’m off-base; robotics isn’t my area, but I have read some papers that indicate that this is viable. This one in particular is pretty well-cited, and I’ve always suspected that a scale-up of training on video data of humans performing tasks could, with some creative pre-processing efforts to properly label the movements therein, get us something that could output series of limb/finger positions that would resolve a given natural language request. From there, we’d need some output postprocessing of the kind we’ve seen inklings of in numerous other LLM+robotics papers to get a roughly humanoid robot to match those movements.
The above is certainly simplified, and I suspect that much more intensive image preprocessing involving object detection and labelling would be necessary, but I do think that you could get a proof of concept along these lines with 2025 tech if you picked a relatively simple class of task.