joseph_c comments on Discrete Generation

joseph_c 14 Oct 2025 22:34 UTC
2 points
0
When generating, we will sample uniformly, which requires
$K L (p_{batch mean} ∥ Uniform) = constant - H (p_{batch mean})$
bits to describe. This gives the loss
$∥ final image - target ∥^{2} - iters \sum iter = 1 H (p_{batch mean}^{iter}) .$
You should be using a MSE between uniform and $p_{batch mean}$ instead of a KL divergence in the loss. The batch mean is only an estimate for what you truly want, which is the mean over the entire dataset (or perhaps, all possible images, but there’s not much you can do about that). If you directly substitute it in for the dataset mean in the KL divergence, the resulting gradients are not unbiased estimators of the correct gradients. On the other hand, if you use a MSE loss instead, the gradients are unbiased estimators of the correct gradients for the MSE. In the limit as the dataset marginals approach the uniform distribution, the gradients of the KL divergence will be parallel to the gradients of the MSE gradients, so it’s okay to use a MSE instead.