But for me it is also one of those examples that make me sad about the outlook for AI and for Computer Vision. What would it take for a computer to understand this image as you or I do? [...]
In any case, we are very, very far and this depresses me. What is the way forward? :(
I just asked gpt-4o what’s going on in the picture, and it understood most of it:
In this image, a group of men in business attire are seen in a locker room or a similar setting. The focus is on two men, where the taller man is standing on a scale. The shorter man, who appears to be playfully pressing down on the taller man’s shoulders to increase his weight on the scale, is creating a humorous situation. Both men and those observing in the background are smiling or laughing, indicating that they are enjoying the lighthearted moment. The man pressing down seems to be taking part in a playful prank or joke, adding a sense of camaraderie and fun to the scene.
12 years ago, in The state of Computer Vision and AI: we are really, really far away, Andrej Karpathy wrote:
I just asked gpt-4o what’s going on in the picture, and it understood most of it:
Of course, Karpathy’s post could be in the multimodal training data.