I don’t think image understanding is the bottleneck. O3 and O4-mini-high seem like they are a meaningful improvement in vision, where it’s almost good enough for this part, but they still fail miserably at the physical reasoning aspects.
This person got O4-mini-high to generate a reasonably close image depiction of the part.
https://x.com/tombielecki/status/1912913806541693253
Interesting that you found this to be the case! I recently had a post about evaluating LLMs on a basic manufacturing task, and I also found this to be the case. It’s always a bit jarring for me to go from the text / code domain, where the LLMs feel so competent, to the vision domain, where I start to feel like Gary Marcus because the LLMs are so bad.
Relevant quote from my post:
”Most Models Have Truly Horrible Visual Abilities: For two years, I’ve observed essentially zero improvement in visual capabilities among models from Anthropic and OpenAI. They always miss obvious features like the flats cut into the round surface, holes, or even hallucinate nonexistent features such as holes drilled along the part’s length. I have never seen Claude 3.5, Claude 3.7 (thinking and non-thinking), GPT-4.5, GPT-4o, or O1-Pro produce a reasonable description of the part. Without vision abilities, creating a manufacturing plan is completely hopeless.
Interestingly, many of these models also score at or above the level of some human experts on visual reasoning benchmarks like MMMU. That which is easy to measure often doesn’t correlate with real world usefulness.”
Note that Gemini 2.5 Pro and O3 both are a signficant improvement in vision for this particular eval.