Successful applications to pixel art tend to inject real-world knowledge, such as through models pretrained on FFHQ, or focus on tasks involving ‘throwing away’ information rather than generating it, such as style transfer of pixel art styles.
Thus, if I wanted to make a Pokemon GAN, I would not attempt to train on solely pixel art scraped from a few games. I would instead start with a large dataset of animals, perhaps from ImageNet or iNaturalist or Wikipedia, real or fictional, and grab all Pokemon art of any kind from anywhere, including dumping individual frames from the Pokemon anime and exploiting CGI models of animals/Pokemon to densely sample all possible images, and would focus on generating as high-quality and diverse a distribution of fantastic beasts as possible; and when that succeeded, treat ‘Pokemon-style pixelization’ as a second phase, to be applied to the high-quality high-resolution photographic fantasy animals generated by the first model.
(This is why I was pushing in Tensorfork for training a single big BigGAN on all the datasets we had, because I knew that a single universal model would beat all of the specialized GANs everyone was doing, and would also likely unlock capabilities that simply could not be trained in isolation, like Pokemon.)
It is noteworthy that the first really good pixel art neural generative model, CLIPIT PixelDraw (and later pixray), relies entirely on the pretrained CLIP model which was trained on n = 400m Internet images. Similarly, Projected GAN’s Pokemon work, but because it is drawing on ImageNet knowledge through the enriched features.
Neural networks show us that sometimes, the hard things are easy & easy hard, because we don’t understand how we think or learn.
I liked Gwern’s remarks at the end of your link: