I don’t think this is true at all. How do you translate, say, rotating multiple shapes in parallel into text?
At least for multimodal LLMs in the pure-token approach like Gato or DALL-E 1 (and probably GPT-4o and Gemini, although few details have been published), you would be able to do that by generating the tokens which embody an encoded image (or video!) of several shapes, well, rotating in parallel. Then you just look at them.
At least for multimodal LLMs in the pure-token approach like Gato or DALL-E 1 (and probably GPT-4o and Gemini, although few details have been published), you would be able to do that by generating the tokens which embody an encoded image (or video!) of several shapes, well, rotating in parallel. Then you just look at them.