The boring hypothesis here is the model was actually trained on the id-location-from-picture task, and wasn’t trained on the id-object-location-in-pixel-art task, and pixel art is surprisingly nontrivial for models to wrap their heads around when they’re still trying to understand real world pictures.
Successful applications to pixel art tend to inject real-world knowledge, such as through models pretrained on FFHQ, or focus on tasks involving ‘throwing away’ information rather than generating it, such as style transfer of pixel art styles.
Thus, if I wanted to make a Pokemon GAN, I would not attempt to train on solely pixel art scraped from a few games. I would instead start with a large dataset of animals, perhaps from ImageNet or iNaturalist or Wikipedia, real or fictional, and grab all Pokemon art of any kind from anywhere, including dumping individual frames from the Pokemon anime and exploiting CGI models of animals/Pokemon to densely sample all possible images, and would focus on generating as high-quality and diverse a distribution of fantastic beasts as possible; and when that succeeded, treat ‘Pokemon-style pixelization’ as a second phase, to be applied to the high-quality high-resolution photographic fantasy animals generated by the first model.
(This is why I was pushing in Tensorfork for training a single big BigGAN on all the datasets we had, because I knew that a single universal model would beat all of the specialized GANs everyone was doing, and would also likely unlock capabilities that simply could not be trained in isolation, like Pokemon.)
It is noteworthy that the first really good pixel art neural generative model, CLIPIT PixelDraw (and later pixray), relies entirely on the pretrained CLIP model which was trained on n = 400m Internet images. Similarly, Projected GAN’s Pokemon work, but because it is drawing on ImageNet knowledge through the enriched features.
Neural networks show us that sometimes, the hard things are easy & easy hard, because we don’t understand how we think or learn.
I think even if the model wasn’t specifically trained for geolocation, it’s a reasonable assumption that metadata from photos (which often includes geo data) somehow gets passed to the models and this created a huge annotated dataset of (geo, photo) pairs during training for stuff like searching Google images.
The boring hypothesis here is the model was actually trained on the id-location-from-picture task, and wasn’t trained on the id-object-location-in-pixel-art task, and pixel art is surprisingly nontrivial for models to wrap their heads around when they’re still trying to understand real world pictures.
I liked Gwern’s remarks at the end of your link:
I think even if the model wasn’t specifically trained for geolocation, it’s a reasonable assumption that metadata from photos (which often includes geo data) somehow gets passed to the models and this created a huge annotated dataset of (geo, photo) pairs during training for stuff like searching Google images.