o3 is scarily good at geo-guessing but can’t work out what’s what on a pixelated screen. That doesn’t make sense to me. Maybe a lack of pixelated training data?
Yeah it is confusing. You’d think there’s tons of available data on pixelated game screens. Maybe training on it somehow degrades performance on other images?
This has been a consistent weakness of OpenAI’s image processing from the start: GPT-4-V came with clearcut warnings against using it on non-photographic inputs like screenshots or documents or tables, and sure enough, I found that it was wildly inaccurate on web page screenshots.
(In particular, I had been hoping to use it to automate Gwern.net regression detection: use a headless browser to screenshot random points in Gwern.net and report back if anything looked ‘wrong’. It seemed like the sort of ‘I know it when I see it’ judgment task a VLM ought to be perfectly suited for. But I discovered when trying it out that GPT-4-V basically couldn’t see even blatant errors like broken dropcaps, and trying to use my script would burn a lot of money to generate mostly just false positives/negatives.)
My guess is that the image datasets are so skewed towards photographs, and the de facto resolution so low, that GUIs/browsers/documents/tables/etc just get turned into garbage. If you ever try turning a screenshot or PDF page into a common image input size, like 224x224px (even a generous 512x512px), you’ll notice that often they become impossible to read or understand in isolation, like a VLM would be forced to. The text labels become almost unreadable, and when they are readable, you have to think about it hard for a while—exactly the sort of thing a cheap small VLM isn’t allowed to do.
This should be highly fixable using autoregressive multimodal LLMs given high-res image encodings and appropriate scale-ups (especially with improved BPE tokenization) but I guess it just hasn’t happened & been deployed at scale yet.
(You’d think they’re going to have to fix it soon, though, in order to make ‘agents’ work. There is no point in spending a lot of money on LLMs monkeying around web pages as blind as they are now.)
o3 is scarily good at geo-guessing but can’t work out what’s what on a pixelated screen. That doesn’t make sense to me. Maybe a lack of pixelated training data?
Yeah it is confusing. You’d think there’s tons of available data on pixelated game screens. Maybe training on it somehow degrades performance on other images?
This has been a consistent weakness of OpenAI’s image processing from the start: GPT-4-V came with clearcut warnings against using it on non-photographic inputs like screenshots or documents or tables, and sure enough, I found that it was wildly inaccurate on web page screenshots.
(In particular, I had been hoping to use it to automate Gwern.net regression detection: use a headless browser to screenshot random points in Gwern.net and report back if anything looked ‘wrong’. It seemed like the sort of ‘I know it when I see it’ judgment task a VLM ought to be perfectly suited for. But I discovered when trying it out that GPT-4-V basically couldn’t see even blatant errors like broken dropcaps, and trying to use my script would burn a lot of money to generate mostly just false positives/negatives.)
My guess is that the image datasets are so skewed towards photographs, and the de facto resolution so low, that GUIs/browsers/documents/tables/etc just get turned into garbage. If you ever try turning a screenshot or PDF page into a common image input size, like 224x224px (even a generous 512x512px), you’ll notice that often they become impossible to read or understand in isolation, like a VLM would be forced to. The text labels become almost unreadable, and when they are readable, you have to think about it hard for a while—exactly the sort of thing a cheap small VLM isn’t allowed to do.
This should be highly fixable using autoregressive multimodal LLMs given high-res image encodings and appropriate scale-ups (especially with improved BPE tokenization) but I guess it just hasn’t happened & been deployed at scale yet.
(You’d think they’re going to have to fix it soon, though, in order to make ‘agents’ work. There is no point in spending a lot of money on LLMs monkeying around web pages as blind as they are now.)
I’m going to remember the point about screenshot parsing being a weak point for ‘agents’.