Interesting that you found this to be the case! I recently had a post about evaluating LLMs on a basic manufacturing task, and I also found this to be the case. It’s always a bit jarring for me to go from the text / code domain, where the LLMs feel so competent, to the vision domain, where I start to feel like Gary Marcus because the LLMs are so bad.
Relevant quote from my post:
”Most Models Have Truly Horrible Visual Abilities: For two years, I’ve observed essentially zero improvement in visual capabilities among models from Anthropic and OpenAI. They always miss obvious features like the flats cut into the round surface, holes, or even hallucinate nonexistent features such as holes drilled along the part’s length. I have never seen Claude 3.5, Claude 3.7 (thinking and non-thinking), GPT-4.5, GPT-4o, or O1-Pro produce a reasonable description of the part. Without vision abilities, creating a manufacturing plan is completely hopeless.
Interestingly, many of these models also score at or above the level of some human experts on visual reasoning benchmarks like MMMU. That which is easy to measure often doesn’t correlate with real world usefulness.”
Note that Gemini 2.5 Pro and O3 both are a signficant improvement in vision for this particular eval.
Interesting that you found this to be the case! I recently had a post about evaluating LLMs on a basic manufacturing task, and I also found this to be the case. It’s always a bit jarring for me to go from the text / code domain, where the LLMs feel so competent, to the vision domain, where I start to feel like Gary Marcus because the LLMs are so bad.
Relevant quote from my post:
”Most Models Have Truly Horrible Visual Abilities: For two years, I’ve observed essentially zero improvement in visual capabilities among models from Anthropic and OpenAI. They always miss obvious features like the flats cut into the round surface, holes, or even hallucinate nonexistent features such as holes drilled along the part’s length. I have never seen Claude 3.5, Claude 3.7 (thinking and non-thinking), GPT-4.5, GPT-4o, or O1-Pro produce a reasonable description of the part. Without vision abilities, creating a manufacturing plan is completely hopeless.
Interestingly, many of these models also score at or above the level of some human experts on visual reasoning benchmarks like MMMU. That which is easy to measure often doesn’t correlate with real world usefulness.”
Note that Gemini 2.5 Pro and O3 both are a signficant improvement in vision for this particular eval.