Its image understanding on GeoGuessr shows some oddities, too. While it’s really good at understanding and recognizing lots of details or landmarks, there seems to be something odd happening when it tries to confirm its guess by searching for existing pictures.
For example, I asked it to geolocate this image (just by telling it “geolocate”, not using Kelsey’s big prompt):
Here was its final answer; its chain-of-thought mentions having searched for pictures of this:
That wooden bell‑tower is the one that stands beside Mortorp Church (Mortorps kyrka) in Kalmar Municipality, Småland, southern Sweden. Approximate coordinates: 56.588 °N, 16.084 °E. [Wikimedia Commons]
If we click on the Wikimedia link that it gave us, we get this picture:
… what? That does not look similar at all!
Looking for pictures that show the rest of the church doesn’t help, either:
(The original picture is from Finland, so at least it correctly landed on the Nordics, even if it’s in the wrong country.)
One reason for this (see here) may be that RLVR models prefer giving an answer with very low probability of being correct to answering that they don’t know. (Though that doesn’t explain their inability to properly see the Pokémon game.)
Its image understanding on GeoGuessr shows some oddities, too. While it’s really good at understanding and recognizing lots of details or landmarks, there seems to be something odd happening when it tries to confirm its guess by searching for existing pictures.
For example, I asked it to geolocate this image (just by telling it “geolocate”, not using Kelsey’s big prompt):
Here was its final answer; its chain-of-thought mentions having searched for pictures of this:
If we click on the Wikimedia link that it gave us, we get this picture:
… what? That does not look similar at all!
Looking for pictures that show the rest of the church doesn’t help, either:
(The original picture is from Finland, so at least it correctly landed on the Nordics, even if it’s in the wrong country.)
One reason for this (see here) may be that RLVR models prefer giving an answer with very low probability of being correct to answering that they don’t know. (Though that doesn’t explain their inability to properly see the Pokémon game.)
I’d bet the webpage parser ignored images, their contents.