What’s up with AI’s vision

Joachim Bartosik3 May 2025 13:23 UTC

12 points

This week I’ve read 2 pieces of interesting information:

Apparently current models (in particular o3) are really good at GeoGuessr (Testing AI’s GeoGuessr Genius),
Apparently they have very had time understanding what they see on screenshots from Pokemon Red, even when stuff is clearly marked (Research Notes: Running Claude 3.7, Gemini 2.5 Pro, and o3 on Pokémon Red).

This is a puzzling to me. I find it much easier to find staircase in Pokemon Red than to figure out where s photo as taken.

Reading about the first example from Scotts post. I guess there are not that many featureless plains in the world. But can it really tell types of grass growing there? But have issues identifying doors / staircases in Pokemon Red? Seems weird.

Do we know what’s up with all that?

Joachim Bartosik3 May 2025 13:23 UTC

12 points

19 comments1 min readLW link

AI AI Capabilities

brambleboy 3 May 2025 17:17 UTC
10 points
3
Probably because the dataset of images + captions scraped from the internet consists of lots of boring photos with locations attributed to them, and not a lot of labeled screenshots of pixel art games with by comparison. This is similar to how LLMs are very good at stylometry, because they have lots of experience making inferences about authors based on patterns in the text.
- brambleboy 3 May 2025 17:27 UTC
  13 points
  5
  Parent
  Another idea: real photos have lots of tiny details to notice regularities in. Pixel art images, on the other hand, can only be interpreted properly by “looking at the big picture”. AI vision is known to be biased towards textures rather than shape, compared to humans.
  - ryan_greenblatt 4 May 2025 4:42 UTC
    4 points
    0
    Parent
    I don’t think it is specific to pixel art, I think it is more about general visual understanding, particularly when you have to figure out downstream consequences from the visual understanding (like “walk to here”).
Kaj_Sotala 3 May 2025 17:50 UTC
9 points
2
Its image understanding on GeoGuessr shows some oddities, too. While it’s really good at understanding and recognizing lots of details or landmarks, there seems to be something odd happening when it tries to confirm its guess by searching for existing pictures.
For example, I asked it to geolocate this image (just by telling it “geolocate”, not using Kelsey’s big prompt):
Here was its final answer; its chain-of-thought mentions having searched for pictures of this:
That wooden bell‑tower is the one that stands beside Mortorp Church (Mortorps kyrka) in Kalmar Municipality, Småland, southern Sweden.
Approximate coordinates: 56.588 °N, 16.084 °E. [Wikimedia Commons]
If we click on the Wikimedia link that it gave us, we get this picture:
… what? That does not look similar at all!
Looking for pictures that show the rest of the church doesn’t help, either:
(The original picture is from Finland, so at least it correctly landed on the Nordics, even if it’s in the wrong country.)
- cubefox 4 May 2025 9:09 UTC
  2 points
  0
  Parent
  One reason for this (see here) may be that RLVR models prefer giving an answer with very low probability of being correct to answering that they don’t know. (Though that doesn’t explain their inability to properly see the Pokémon game.)
- Martin Vlach 10 May 2025 7:18 UTC
  1 point
  0
  Parent
  I’d bet the webpage parser ignored images, their contents.
Garrett Baker 3 May 2025 17:03 UTC
7 points
5
The boring hypothesis here is the model was actually trained on the id-location-from-picture task, and wasn’t trained on the id-object-location-in-pixel-art task, and pixel art is surprisingly nontrivial for models to wrap their heads around when they’re still trying to understand real world pictures.
- Mo Putera 4 May 2025 5:58 UTC
  4 points
  2
  Parent
  I liked Gwern’s remarks at the end of your link:
  Successful applications to pixel art tend to inject real-world knowledge, such as through models pretrained on FFHQ, or focus on tasks involving ‘throwing away’ information rather than generating it, such as style transfer of pixel art styles.
  Thus, if I wanted to make a Pokemon GAN, I would not attempt to train on solely pixel art scraped from a few games. I would instead start with a large dataset of animals, perhaps from ImageNet or iNaturalist or Wikipedia, real or fictional, and grab all Pokemon art of any kind from anywhere, including dumping individual frames from the Pokemon anime and exploiting CGI models of animals/Pokemon to densely sample all possible images, and would focus on generating as high-quality and diverse a distribution of fantastic beasts as possible; and when that succeeded, treat ‘Pokemon-style pixelization’ as a second phase, to be applied to the high-quality high-resolution photographic fantasy animals generated by the first model.
  (This is why I was pushing in Tensorfork for training a single big BigGAN on all the datasets we had, because I knew that a single universal model would beat all of the specialized GANs everyone was doing, and would also likely unlock capabilities that simply could not be trained in isolation, like Pokemon.)
  It is noteworthy that the first really good pixel art neural generative model, CLIPIT PixelDraw (and later pixray), relies entirely on the pretrained CLIP model which was trained on n = 400m Internet images. Similarly, Projected GAN’s Pokemon work, but because it is drawing on ImageNet knowledge through the enriched features.
  Neural networks show us that sometimes, the hard things are easy & easy hard, because we don’t understand how we think or learn.
- purple fire 5 May 2025 23:19 UTC
  3 points
  0
  Parent
  I think even if the model wasn’t specifically trained for geolocation, it’s a reasonable assumption that metadata from photos (which often includes geo data) somehow gets passed to the models and this created a huge annotated dataset of (geo, photo) pairs during training for stuff like searching Google images.
johnswentworth 3 May 2025 16:43 UTC
4 points
0
I wouldn’t necessarily expect this to be what’s going on, but just to check… are approximately-all the geoguessr images people try drawn from a single dataset on which the models might plausibly have been trained? Like, say, all the streetview images from google maps?
- Joachim Bartosik 3 May 2025 16:56 UTC
  11 points
  2
  Parent
  Apparently no. Scott wrote he used one image from Google maps, and 4 personal images that are not available online.
  People tried with personal photos too.
  I tried with personal photos (screenshotted from Google photos) and it worked pretty well too :
  - Identified neighborhood in Lisbon where a picture was taken
  - Identified another picture as taken in Paris
  - Another one identified as taken in a big polish city, the correct answer was among 4 candidates it listed
    I didn’t use a long prompt like the one Scott copies in his post, just short „You’re in GeoGuesser, where was this picture taken” or something like that
- edge_retainer 4 May 2025 23:19 UTC
  5 points
  2
  Parent
  i have used tons of personal photos w/ kelsey’s prompt, it has been extremely successful (>75% + never get’s it wrong if one of my friends can guess it too), I’m confident none of these photos are on the internet and most aren’t even that similar to existing photos. Creepily enough it’s not half bad at figuring out where people are indoors as well (not as good, but like it got the neighborhood in Budapest I was in from a photo of a single room, with some items on a table).
- faul_sname 3 May 2025 17:32 UTC
  4 points
  0
  Parent
  Nope, although it is does have a much higher propensity to exhibit GeoGuessr behavior on pictures on or next to a road when given ambiguous prompts (initial post, slightly more rigorous analysis).
  
  I think it’s possible (25%) that o3 was explicitly trained on exactly the GeoGuessr task, but more likely (40%) that it was trained on e.g. minimizing perplexity on image captions, and that knowing the exact location of the image is useful for that, and it managed to evoke the “GeoGuessr” behavior in its reasoning chain once and that behavior was strongly reinforced and now it does it whenever it could plausibly be helpful.
- Garrett Baker 3 May 2025 16:57 UTC
  3 points
  0
  Parent
  My understanding is its not approximately all, it is literally all the images in geoguessr.
Nate Showell 4 May 2025 22:44 UTC
3 points
0
Do LLMs perform better at games that are later in the Pokemon series? If difficulty interpreting pixel art is what’s holding them back, it would be less of a problem when playing later Pokemon games with higher-resolution sprites.
Martin Vlach 10 May 2025 7:20 UTC
1 point
0
There’s the thing where Gemini 2.5 Pro surprisingly isn’t very good at geo guessing, a woman’s tweet is too be linked <here>.
mera 6 May 2025 9:36 UTC
1 point
0
Do we know that it’s the recognising-content-of-images part of the task that is difficult? iirc a couple of years ago there was someone who made a geoguesser “find each other” game where two people could “load into” google maps and would try to meet up. I played it with a friend in a city we both knew. This seems like it might cleave into (some of) the difference between the planning part of the task and the image recognition. I’ll try to find the game (iirc it was a student project, maybe on itch, probably not on steam).
Mis-Understandings 4 May 2025 15:23 UTC
1 point
0
Also note that there are people who can tutor you in geoguesser, but not in interpreting pixel art.
If even one blog that goes through that process step by step ends up in the training data, and it is routinely a useful subtask in image tasks (What and where are correlated), then the subcapacity can be directly elicited.
Trevor Hill-Hand 3 May 2025 20:23 UTC
1 point
0
Could it be that it’s able to rely on an internal model of the globe, which it had a lot more detailed GPS-tracked training on, than it does game worlds?