I agree frontier models severely lack spatial reasoning on images, which I attribute to a lack of in-depth spatial discussion of images on the internet. My model of frontier models’ vision capabilities is that they have very deep knowledge of aspects of images that relate to text that happens to be immediately before or after it in web text, and only a very small fraction of images on the internet have accompanying in-depth spatial discussion. The models are very good at for instance guessing the location of where photos were taken, vastly better than most humans, because locations are more often mentioned around photos. I expect that if labs want to, they can construct enough semi-synthetic data to fix this.
I do agree that it looks like there has been a lack of data to address this ability. That being said, I’m pretty surprised at how terrible models are, and there’s a hierarchy of problems to be addressed here before models are actually useful in the physical world. Each step feels much more difficult than the step before, and all models are completely terrible at steps 2-4.
First, simply look at a part and identify features / if a part is symmetric / etc. This requires basically no spatial reasoning ability, yet almost all models are completely terrible. Even Gemini is very bad. I’m pretty surprised that this ability didn’t just fall out of scaling on data, but it does seem like this could be easily addressed with synthetic data.
Have some basic spatial reasoning ability where you can propose operations that are practical and aren’t physically impossible. This is much more challenging. First, it could be difficult to automatically generate practical solutions. Secondly, it may require moving beyond text chain of thought—when I walk through a setup, I don’t use language at all and just visualize everything.
Have an understanding of much of the tacit knowledge in machining, or rederive everything from first principles. Getting data could be especially challenging here.
Once you can create a single part correctly, now propose multiple different ways to manufacture the part. Evaluate all of the different plans and choose the best combination of cost, simplicity, and speed. This is the part of the job that’s actually challenging.
You mean something like using libraries of 3D models, maybe some kind of generative grammar of how to arrange them, and renderer feedback for what’s actually visible to produce (geometry description, realistic rendering) pairs to train visuospatial understanding on?
I agree frontier models severely lack spatial reasoning on images, which I attribute to a lack of in-depth spatial discussion of images on the internet. My model of frontier models’ vision capabilities is that they have very deep knowledge of aspects of images that relate to text that happens to be immediately before or after it in web text, and only a very small fraction of images on the internet have accompanying in-depth spatial discussion. The models are very good at for instance guessing the location of where photos were taken, vastly better than most humans, because locations are more often mentioned around photos. I expect that if labs want to, they can construct enough semi-synthetic data to fix this.
I do agree that it looks like there has been a lack of data to address this ability. That being said, I’m pretty surprised at how terrible models are, and there’s a hierarchy of problems to be addressed here before models are actually useful in the physical world. Each step feels much more difficult than the step before, and all models are completely terrible at steps 2-4.
First, simply look at a part and identify features / if a part is symmetric / etc. This requires basically no spatial reasoning ability, yet almost all models are completely terrible. Even Gemini is very bad. I’m pretty surprised that this ability didn’t just fall out of scaling on data, but it does seem like this could be easily addressed with synthetic data.
Have some basic spatial reasoning ability where you can propose operations that are practical and aren’t physically impossible. This is much more challenging. First, it could be difficult to automatically generate practical solutions. Secondly, it may require moving beyond text chain of thought—when I walk through a setup, I don’t use language at all and just visualize everything.
Have an understanding of much of the tacit knowledge in machining, or rederive everything from first principles. Getting data could be especially challenging here.
Once you can create a single part correctly, now propose multiple different ways to manufacture the part. Evaluate all of the different plans and choose the best combination of cost, simplicity, and speed. This is the part of the job that’s actually challenging.
You mean something like using libraries of 3D models, maybe some kind of generative grammar of how to arrange them, and renderer feedback for what’s actually visible to produce (geometry description, realistic rendering) pairs to train visuospatial understanding on?