Something which I think highly relevant, and which might inform your GAN discussion, is the difference in performance of W-GAN—basically if you train the GAN using an optimal transport metric instead of an information theoretic one, it seems to have much better robustness properties, and this is probably because shannon entropy doesn’t respect continuity of your underlying metric space (e.g. KL divergence between Delta(x0) and Delta(x0 + epsilon) is infinity for any nonzero epsilon, so it doesnt capture ‘closeness’). I don’t yet know how I think this should tie into the high-probability latent manifold story you tell, but it seems like part of it.
Something which I think highly relevant, and which might inform your GAN discussion, is the difference in performance of W-GAN—basically if you train the GAN using an optimal transport metric instead of an information theoretic one, it seems to have much better robustness properties, and this is probably because shannon entropy doesn’t respect continuity of your underlying metric space (e.g. KL divergence between Delta(x0) and Delta(x0 + epsilon) is infinity for any nonzero epsilon, so it doesnt capture ‘closeness’). I don’t yet know how I think this should tie into the high-probability latent manifold story you tell, but it seems like part of it.