Although not mentioned in Yang’s paper, we can instead select images proportional to
p∝e−β⋅error(x,target)
…
This gives the loss
∥final image−target∥2−iters∑iter=1H(piterbatch mean).
If we want an infinite-depth model, we can choose to sometimes halt, but usually sample another image with probability δ (for ‘discount factor’). Also, as the depth increases, the images should become more similar to each other, so β should increase exponentially to compensate. Empirically, I found β=δ−iter as δ→1 to give decent results.
I think you should choose β so that
(β2)−1=∑Bi=1∥best xi−targeti∥2B−1,
the sample variance over the batch between the closest choice and the target. This is because a good model should match both the mean and the variance of the ground truth. The ground truth is that, when you encode an image, you choose the xi that has the least reconstruction error. The probabilities pi∝e−βerror(xi,target) can be interpreted as conditional probabilities that you chose the right xi for the encoding, where each xi has a Gaussian prior for being the “right” encoding with mean xi and variance 2/β. The variance of the prior for the xi that is actually chosen should match the variance it sees in the real world. Hence, my recommendation for β.
I think you should choose β so that (β2)−1=∑Bi=1∥best xi−targeti∥2B−1, the sample variance over the batch between the closest choice and the target. This is because a good model should match both the mean and the variance of the ground truth. The ground truth is that, when you encode an image, you choose the xi that has the least reconstruction error. The probabilities pi∝e−βerror(xi,target) can be interpreted as conditional probabilities that you chose the right xi for the encoding, where each xi has a Gaussian prior for being the “right” encoding with mean xi and variance 2/β. The variance of the prior for the xi that is actually chosen should match the variance it sees in the real world. Hence, my recommendation for β.
(You should weight the MSE loss by β as well.)