joseph_c comments on Discrete Generation

joseph_c 14 Oct 2025 2:46 UTC
2 points
1

Although not mentioned in Yang’s paper, we can instead select images proportional to $p \propto e^{- β \cdot error (x, target)}$ …

This gives the loss $∥ final image - target ∥^{2} - iters \sum iter = 1 H (p_{batch mean}^{iter}) .$ If we want an infinite-depth model, we can choose to sometimes halt, but usually sample another image with probability $δ$ (for ‘discount factor’). Also, as the depth increases, the images should become more similar to each other, so $β$ should increase exponentially to compensate. Empirically, I found $β = δ^{- iter}$ as $δ \to 1$ to give decent results.

I think you should choose $β$ so that ${(\frac{β}{2})}^{- 1} = \frac{\sum_{i = 1}^{B} ∥ best x_{i} - {target}_{i} ∥^{2}}{B - 1},$ the sample variance over the batch between the closest choice and the target. This is because a good model should match both the mean and the variance of the ground truth. The ground truth is that, when you encode an image, you choose the $x_{i}$ that has the least reconstruction error. The probabilities $p_{i} \propto e^{- β error (x_{i}, target)}$ can be interpreted as conditional probabilities that you chose the right $x_{i}$ for the encoding, where each $x_{i}$ has a Gaussian prior for being the “right” encoding with mean $x_{i}$ and variance $2 / β$ . The variance of the prior for the $x_{i}$ that is actually chosen should match the variance it sees in the real world. Hence, my recommendation for $β$ .

(You should weight the MSE loss by $β$ as well.)