This has been a consistent weakness of OpenAI’s image processing from the start: GPT-4-V came with clearcut warnings against using it on non-photographic inputs like screenshots or documents or tables, and sure enough, I found that it was wildly inaccurate on web page screenshots.
(In particular, I had been hoping to use it to automate Gwern.net regression detection: use a headless browser to screenshot random points in Gwern.net and report back if anything looked ‘wrong’. It seemed like the sort of ‘I know it when I see it’ judgment task a VLM ought to be perfectly suited for. But I discovered when trying it out that GPT-4-V basically couldn’t see even blatant errors like broken dropcaps, and trying to use my script would burn a lot of money to generate mostly just false positives/negatives.)
My guess is that the image datasets are so skewed towards photographs, and the de facto resolution so low, that GUIs/browsers/documents/tables/etc just get turned into garbage. If you ever try turning a screenshot or PDF page into a common image input size, like 224x224px (even a generous 512x512px), you’ll notice that often they become impossible to read or understand in isolation, like a VLM would be forced to. The text labels become almost unreadable, and when they are readable, you have to think about it hard for a while—exactly the sort of thing a cheap small VLM isn’t allowed to do.
This should be highly fixable using autoregressive multimodal LLMs given high-res image encodings and appropriate scale-ups (especially with improved BPE tokenization) but I guess it just hasn’t happened & been deployed at scale yet.
(You’d think they’re going to have to fix it soon, though, in order to make ‘agents’ work. There is no point in spending a lot of money on LLMs monkeying around web pages as blind as they are now.)
This has been a consistent weakness of OpenAI’s image processing from the start: GPT-4-V came with clearcut warnings against using it on non-photographic inputs like screenshots or documents or tables, and sure enough, I found that it was wildly inaccurate on web page screenshots.
(In particular, I had been hoping to use it to automate Gwern.net regression detection: use a headless browser to screenshot random points in Gwern.net and report back if anything looked ‘wrong’. It seemed like the sort of ‘I know it when I see it’ judgment task a VLM ought to be perfectly suited for. But I discovered when trying it out that GPT-4-V basically couldn’t see even blatant errors like broken dropcaps, and trying to use my script would burn a lot of money to generate mostly just false positives/negatives.)
My guess is that the image datasets are so skewed towards photographs, and the de facto resolution so low, that GUIs/browsers/documents/tables/etc just get turned into garbage. If you ever try turning a screenshot or PDF page into a common image input size, like 224x224px (even a generous 512x512px), you’ll notice that often they become impossible to read or understand in isolation, like a VLM would be forced to. The text labels become almost unreadable, and when they are readable, you have to think about it hard for a while—exactly the sort of thing a cheap small VLM isn’t allowed to do.
This should be highly fixable using autoregressive multimodal LLMs given high-res image encodings and appropriate scale-ups (especially with improved BPE tokenization) but I guess it just hasn’t happened & been deployed at scale yet.
(You’d think they’re going to have to fix it soon, though, in order to make ‘agents’ work. There is no point in spending a lot of money on LLMs monkeying around web pages as blind as they are now.)
I’m going to remember the point about screenshot parsing being a weak point for ‘agents’.