I’m confused by this. The KL term we are looking at in the deterministic case is DKL(P[X,Λ]||P[Λ]P[X1|Λ]P[X2|Λ]), right?
For simplicity, we imagine we have finite discrete spaces. Then this would blow up if P[X=(x1,x2),Λ=λ]≠0, and P[Λ=λ]P[X1=x1|Λ=λ]P[X2=x2|Λ=λ]=0. But this is impossible, because any of the terms in the product being 0 imply that P[X=(x1,x2),Λ=λ] is 0.
Intuitively, we construct an optimal code for encoding the distribution P[Λ]P[X1|Λ]P[X2|Λ], and the KL divergence measures how many more bits on average we need to encode a message than optimal, if the true distribution is given by P[X,Λ]. Issues occur when but the true distribution P[X,Λ] takes on values which never occur according to P[Λ]P[X1|Λ]P[X2|Λ], i.e: the optimal code doesn’t account for those values potentially occurring.
Potentially there are subtleties when we have continuous spaces. In any case I’d be grateful if you’re able to elaborate.
Yeah, I’ve since updated that deterministic functions are probably the right thing here after all, and I was indeed wrong in exactly the way you’re pointing out.
I’m confused by this. The KL term we are looking at in the deterministic case is
DKL(P[X,Λ]||P[Λ]P[X1|Λ]P[X2|Λ]), right?
For simplicity, we imagine we have finite discrete spaces. Then this would blow up if P[X=(x1,x2),Λ=λ]≠0, and P[Λ=λ]P[X1=x1|Λ=λ]P[X2=x2|Λ=λ]=0. But this is impossible, because any of the terms in the product being 0 imply that P[X=(x1,x2),Λ=λ] is 0.
Intuitively, we construct an optimal code for encoding the distribution P[Λ]P[X1|Λ]P[X2|Λ], and the KL divergence measures how many more bits on average we need to encode a message than optimal, if the true distribution is given by P[X,Λ]. Issues occur when but the true distribution P[X,Λ] takes on values which never occur according to P[Λ]P[X1|Λ]P[X2|Λ], i.e: the optimal code doesn’t account for those values potentially occurring.
Potentially there are subtleties when we have continuous spaces. In any case I’d be grateful if you’re able to elaborate.
Yeah, I’ve since updated that deterministic functions are probably the right thing here after all, and I was indeed wrong in exactly the way you’re pointing out.