judgeka comments on Research Notes: Running Claude 3.7, Gemini 2.5 Pro, and o3 on Pokémon Red

judgeka 22 Apr 2025 6:07 UTC
3 points
0
o3 is scarily good at geo-guessing but can’t work out what’s what on a pixelated screen. That doesn’t make sense to me. Maybe a lack of pixelated training data?
- Julian Bradshaw 22 Apr 2025 19:59 UTC
  4 points
  0
  Parent
  Yeah it is confusing. You’d think there’s tons of available data on pixelated game screens. Maybe training on it somehow degrades performance on other images?
  - gwern 22 Apr 2025 23:10 UTC
    17 points
    0
    Parent
    This has been a consistent weakness of OpenAI’s image processing from the start: GPT-4-V came with clearcut warnings against using it on non-photographic inputs like screenshots or documents or tables, and sure enough, I found that it was wildly inaccurate on web page screenshots.
    
    (In particular, I had been hoping to use it to automate Gwern.net regression detection: use a headless browser to screenshot random points in Gwern.net and report back if anything looked ‘wrong’. It seemed like the sort of ‘I know it when I see it’ judgment task a VLM ought to be perfectly suited for. But I discovered when trying it out that GPT-4-V basically couldn’t see even blatant errors like broken dropcaps, and trying to use my script would burn a lot of money to generate mostly just false positives/negatives.)
    
    My guess is that the image datasets are so skewed towards photographs, and the de facto resolution so low, that GUIs/browsers/documents/tables/etc just get turned into garbage. If you ever try turning a screenshot or PDF page into a common image input size, like 224x224px (even a generous 512x512px), you’ll notice that often they become impossible to read or understand in isolation, like a VLM would be forced to. The text labels become almost unreadable, and when they are readable, you have to think about it hard for a while—exactly the sort of thing a cheap small VLM isn’t allowed to do.
    
    This should be highly fixable using autoregressive multimodal LLMs given high-res image encodings and appropriate scale-ups (especially with improved BPE tokenization) but I guess it just hasn’t happened & been deployed at scale yet.
    
    (You’d think they’re going to have to fix it soon, though, in order to make ‘agents’ work. There is no point in spending a lot of money on LLMs monkeying around web pages as blind as they are now.)
    - Sheikh Abdur Raheem Ali 17 May 2025 20:15 UTC
      2 points
      0
      Parent
      I’m going to remember the point about screenshot parsing being a weak point for ‘agents’.