Yes, I understand this point. I was saying that we’d expect it to get 0% if its algorithm is “guess yes for anything in the training set and no for anything outside of it”.
It continues to be surprising (to me) even though we expect that it’s trying to follow that algorithm but can’t do so exactly. Presumably the generator is able to emulate the features that it’s using for inexactly matching the training set. In this case, if those features were “looks like something from the training/test distribution”, we’d expect it to guess closer to 100% on the test set. If those features were highly specific to the training set, we’d expect it to get closer to 0% on the test set (since the model should reject anything without those features). Instead it gets ~50% which means whatever it’s looking for is completely uncorrelated to what the test data looks like and present in half of the examples—that seems surprising to me.
I’d currently interpret this as “the discriminator network acts nonsensically outside the training set + generator distribution, so it gets close to chance just because that’s what nonsensical networks do.”
Thanks for sharing thoughts and links: discriminator ranking, SimCLR, CR, and BCR are all interesting and I hadn’t run into them yet. My naive thought was that you’d have to use differentiable augmenters to fit in generator augmentation.
I’m averse to using Twitter, but I will consider being motivated enough to sign-up and ask. Thanks for pointing this out.
I am definitely using this concept too vaguely, although I was gesturing at compression in the discriminator instead of the generator. Thinking of the discriminator as a lossy compressor in this way would be… positing a mapping f: discriminator weights → distributions, which for trained weights does not fully recapture the training distribution? We could see G as attempting to match this imperfect distribution (since it doesn’t directly receive the training examples), and D as modifying weights to simultaneously 1. try to capture the training distribution as f(D), and 2. try to have f(D) avoid the output of G. Hence why I was thinking D might be “obfuscating”—in this picture, I think f(D) is pressured to be a more complicated manifold while sticking close to the training distribution, making it more difficult for G to fit it.
Is such an f implicit in the discriminator outputs? I think that it is just by normalizing across the whole space, although that’s computationally infeasible. I’d be interested in work that attempts to recover the training distribution from D alone.
I think it’s decently likely I’m confused here.