GAN Discriminators Don’t Generalize?

Disclaimer: I just started reading about GANs, so am almost certainly missing some context here.

Something that surprised me from the BigGAN paper:

We also observe that D’s loss approaches zero during training, but undergoes a sharp upward jump at collapse (Appendix F). One possible explanation for this behavior is that D is overfitting to the training set, memorizing training examples rather than learning some meaningful boundary between real and generated images. As a simple test for D’s memorization (related to Gulrajani et al. (2017)), we evaluate uncollapsed discriminators on the ImageNet training and validation sets, and measure what percentage of samples are classified as real or generated. While the training accuracy is consistently above 98%, the validation accuracy falls in the range of 50-55%, no better than random guessing (regardless of regularization strategy). This confirms that D is indeed memorizing the training set; we deem this in line with D’s role, which is not explicitly to generalize, but to distill the training data and provide a useful learning signal for G.

I’m not sure how to interpret this. The validation accuracy being close to 50% seems strange—if the discriminator has ‘memorized’ the training set and has only seen training set vs generated images, why would it not guess close to 0% on things in the test set? Presumably they are both 1. not-memorized and 2. not optimized to fool the discriminator like generated images are. Maybe the post title is misleading, and we should think of this as “discriminators generalize surprisingly well despite also ‘memorizing’ the training data.” (EDIT: See comment thread here for clarification)

Note that the discriminator has far fewer parameters than there are bytes to memorize, so it necessarily is performing some sort of (lossy) compression to do well on the training set. Could we think of the generator as succeeding by exploiting patterns in the discriminator’s compression, which the discriminator then works to obfuscate? I would expect more obfuscation to put additional demands on the discriminator’s capacity. Maybe good generator task performance then comes from defeating simpler compression schemes, and it so happens that simple compression schemes are exactly what our visual system and GAN metrics are measuring.

Does this indicate that datasets are still just too small? Later in the same paper, they train on the much larger JFT-300M dataset (as opposed to ImageNet above) and mention:

Interestingly, unlike models trained on ImageNet, where training tends to collapse without heavy regularization (Section 4), the models trained on JFT-300M remain stable over many hundreds of thousands of iterations. This suggests that moving beyond ImageNet to larger datasets may partially alleviate GAN stability issues.

They don’t mention whether this also increases discriminator generalization or decreases training set accuracy, which I’d be interested to know. I’d also be interested in connecting this story to mode collapse somehow.