If you actually look at the number of bits of training data the human brain receives from birth to adulthood, a huge proportion of them are visual data. So I’m not surprised that we’re comparatively good at 3D (and our nervous system very likely also has some good inductive priors for it). I suspect the answer for LLMs is mostly just multimodal models trained on a vast amount of video training data — expensive, though the cost is reducible somewhat by coming up with smarter ways to tokenize video.
If you actually look at the number of bits of training data the human brain receives from birth to adulthood, a huge proportion of them are visual data. So I’m not surprised that we’re comparatively good at 3D (and our nervous system very likely also has some good inductive priors for it). I suspect the answer for LLMs is mostly just multimodal models trained on a vast amount of video training data — expensive, though the cost is reducible somewhat by coming up with smarter ways to tokenize video.