What’s up with AI’s vision
This week I’ve read 2 pieces of interesting information:
Apparently current models (in particular o3) are really good at GeoGuessr (Testing AI’s GeoGuessr Genius),
Apparently they have very had time understanding what they see on screenshots from Pokemon Red, even when stuff is clearly marked (Research Notes: Running Claude 3.7, Gemini 2.5 Pro, and o3 on Pokémon Red).
This is a puzzling to me. I find it much easier to find staircase in Pokemon Red than to figure out where s photo as taken.
Reading about the first example from Scotts post. I guess there are not that many featureless plains in the world. But can it really tell types of grass growing there? But have issues identifying doors / staircases in Pokemon Red? Seems weird.
Do we know what’s up with all that?
Probably because the dataset of images + captions scraped from the internet consists of lots of boring photos with locations attributed to them, and not a lot of labeled screenshots of pixel art games with by comparison. This is similar to how LLMs are very good at stylometry, because they have lots of experience making inferences about authors based on patterns in the text.
Another idea: real photos have lots of tiny details to notice regularities in. Pixel art images, on the other hand, can only be interpreted properly by “looking at the big picture”. AI vision is known to be biased towards textures rather than shape, compared to humans.
I don’t think it is specific to pixel art, I think it is more about general visual understanding, particularly when you have to figure out downstream consequences from the visual understanding (like “walk to here”).
Its image understanding on GeoGuessr shows some oddities, too. While it’s really good at understanding and recognizing lots of details or landmarks, there seems to be something odd happening when it tries to confirm its guess by searching for existing pictures.
For example, I asked it to geolocate this image (just by telling it “geolocate”, not using Kelsey’s big prompt):
Here was its final answer; its chain-of-thought mentions having searched for pictures of this:
If we click on the Wikimedia link that it gave us, we get this picture:
… what? That does not look similar at all!
Looking for pictures that show the rest of the church doesn’t help, either:
(The original picture is from Finland, so at least it correctly landed on the Nordics, even if it’s in the wrong country.)
One reason for this (see here) may be that RLVR models prefer giving an answer with very low probability of being correct to answering that they don’t know. (Though that doesn’t explain their inability to properly see the Pokémon game.)
I’d bet the webpage parser ignored images, their contents.
The boring hypothesis here is the model was actually trained on the id-location-from-picture task, and wasn’t trained on the id-object-location-in-pixel-art task, and pixel art is surprisingly nontrivial for models to wrap their heads around when they’re still trying to understand real world pictures.
I liked Gwern’s remarks at the end of your link:
I think even if the model wasn’t specifically trained for geolocation, it’s a reasonable assumption that metadata from photos (which often includes geo data) somehow gets passed to the models and this created a huge annotated dataset of (geo, photo) pairs during training for stuff like searching Google images.
I wouldn’t necessarily expect this to be what’s going on, but just to check… are approximately-all the geoguessr images people try drawn from a single dataset on which the models might plausibly have been trained? Like, say, all the streetview images from google maps?
Apparently no. Scott wrote he used one image from Google maps, and 4 personal images that are not available online.
People tried with personal photos too.
I tried with personal photos (screenshotted from Google photos) and it worked pretty well too :
Identified neighborhood in Lisbon where a picture was taken
Identified another picture as taken in Paris
Another one identified as taken in a big polish city, the correct answer was among 4 candidates it listed
I didn’t use a long prompt like the one Scott copies in his post, just short „You’re in GeoGuesser, where was this picture taken” or something like that
i have used tons of personal photos w/ kelsey’s prompt, it has been extremely successful (>75% + never get’s it wrong if one of my friends can guess it too), I’m confident none of these photos are on the internet and most aren’t even that similar to existing photos. Creepily enough it’s not half bad at figuring out where people are indoors as well (not as good, but like it got the neighborhood in Budapest I was in from a photo of a single room, with some items on a table).
Nope, although it is does have a much higher propensity to exhibit GeoGuessr behavior on pictures on or next to a road when given ambiguous prompts (initial post, slightly more rigorous analysis).
I think it’s possible (25%) that o3 was explicitly trained on exactly the GeoGuessr task, but more likely (40%) that it was trained on e.g. minimizing perplexity on image captions, and that knowing the exact location of the image is useful for that, and it managed to evoke the “GeoGuessr” behavior in its reasoning chain once and that behavior was strongly reinforced and now it does it whenever it could plausibly be helpful.
My understanding is its not approximately all, it is literally all the images in geoguessr.
Do LLMs perform better at games that are later in the Pokemon series? If difficulty interpreting pixel art is what’s holding them back, it would be less of a problem when playing later Pokemon games with higher-resolution sprites.
There’s the thing where Gemini 2.5 Pro surprisingly isn’t very good at geo guessing, a woman’s tweet is too be linked <here>.
Do we know that it’s the recognising-content-of-images part of the task that is difficult? iirc a couple of years ago there was someone who made a geoguesser “find each other” game where two people could “load into” google maps and would try to meet up. I played it with a friend in a city we both knew. This seems like it might cleave into (some of) the difference between the planning part of the task and the image recognition. I’ll try to find the game (iirc it was a student project, maybe on itch, probably not on steam).
Also note that there are people who can tutor you in geoguesser, but not in interpreting pixel art.
If even one blog that goes through that process step by step ends up in the training data, and it is routinely a useful subtask in image tasks (What and where are correlated), then the subcapacity can be directly elicited.
Could it be that it’s able to rely on an internal model of the globe, which it had a lot more detailed GPS-tracked training on, than it does game worlds?