I’ve been working on the reverse direction: chopping up P[Λ] by clustering the points (treating each distribution as a point in distribution space) given by P[Λ|X=x], optimizing for a deterministic-in-X latent Δ=Δ(X) which minimizes DKL(P[Λ|X]||P[Λ|Δ(X)]).
This definitely separates X1 and X2 to some small error, since we can just use Δ to build a distribution over Λ which should approximately separate X1 and X2.
To show that it’s deterministic in X1 (and by symmetry X2) to some small error, I was hoping to use the fact that—givenX1X2—has very little information about Λ, so it’s unlikely that P[Λ|X1] is in a different cluster to P[Λ|X1,X2]. This means that P[Δ|X1] would just put most of the weight on the cluster containing P[Λ|X1].
A constructive approach for Δ would be marginally more useful in the long-run, but it’s also probably easier to prove things about the optimal Δ. It’s also probably easier to prove things about Δ for a given number of clusters |Δ|, but then you also have to prove things about what the optimal value of |Δ| is.
Sounds like you’ve correctly understood the problem and are thinking along roughly the right lines. I expect a deterministic function of X won’t work, though.
Hand-wavily: the problem is that, if we take the latent to be a deterministic function Δ(X), then P[X|Δ(X)] has lots of zeros in it—not approximate-zeros, but true zeros. That will tend to blow up the KL-divergences in the approximation conditions.
I’d recommend looking for a function Δ(Λ). Unfortunately that does mean that low entropy of Δ(Λ)given X has to be proven.
I’m confused by this. The KL term we are looking at in the deterministic case is DKL(P[X,Λ]||P[Λ]P[X1|Λ]P[X2|Λ]), right?
For simplicity, we imagine we have finite discrete spaces. Then this would blow up if P[X=(x1,x2),Λ=λ]≠0, and P[Λ=λ]P[X1=x1|Λ=λ]P[X2=x2|Λ=λ]=0. But this is impossible, because any of the terms in the product being 0 imply that P[X=(x1,x2),Λ=λ] is 0.
Intuitively, we construct an optimal code for encoding the distribution P[Λ]P[X1|Λ]P[X2|Λ], and the KL divergence measures how many more bits on average we need to encode a message than optimal, if the true distribution is given by P[X,Λ]. Issues occur when but the true distribution P[X,Λ] takes on values which never occur according to P[Λ]P[X1|Λ]P[X2|Λ], i.e: the optimal code doesn’t account for those values potentially occurring.
Potentially there are subtleties when we have continuous spaces. In any case I’d be grateful if you’re able to elaborate.
Yeah, I’ve since updated that deterministic functions are probably the right thing here after all, and I was indeed wrong in exactly the way you’re pointing out.
Huh, I had vaguely considered that but I expected any P[X|Δ(X)]=0 terms to be counterbalanced by P[X,Δ(X)]=0 terms, which together contribute nothing to the KL-divergence. I’ll check my intuitions though.
I’m honestly pretty stumped at the moment. The simplest test case I’ve been using is for X1 and X2 to be two flips of a biased coin, where the bias is known to be either k or 1−k with equal probability of either. As k varies, we want to swap from Δ≅Λ to the trivial case |Δ|=1 and back. This (optimally) happens at around k=0.08 and k=0.92. If we swap there, then the sum of errors for the three diagrams of Δ does remain less than 2(ϵ+ϵ+ϵ) at all times.
Likewise, if we do try to define Δ(X), we need to swap from a Δ which is equal to the number of heads, to |Δ|=1, and back.
In neither case can I find a construction of Δ(X) or Δ(Λ) which swaps from one phase to the other at the right time! My final thought is for Δ to be some mapping Λ→P(Λ) consisting of a ball in probability space of variable radius (no idea how to calculate the radius) which would take k→{k} at k≈1 and k→{k,1−k} at k≈0.5. Or maybe you have to map Λ→P(X) or something like that. But for now I don’t even have a construction I can try to prove things for.
Perhaps a constructive approach isn’t feasible, which probably means I don’t have quite the right skillset to do this.
I’ve been working on the reverse direction: chopping up P[Λ] by clustering the points (treating each distribution as a point in distribution space) given by P[Λ|X=x], optimizing for a deterministic-in-X latent Δ=Δ(X) which minimizes DKL(P[Λ|X]||P[Λ|Δ(X)]).
This definitely separates X1 and X2 to some small error, since we can just use Δ to build a distribution over Λ which should approximately separate X1 and X2.
To show that it’s deterministic in X1 (and by symmetry X2) to some small error, I was hoping to use the fact that—givenX1X2—has very little information about Λ, so it’s unlikely that P[Λ|X1] is in a different cluster to P[Λ|X1,X2]. This means that P[Δ|X1] would just put most of the weight on the cluster containing P[Λ|X1].
A constructive approach for Δ would be marginally more useful in the long-run, but it’s also probably easier to prove things about the optimal Δ. It’s also probably easier to prove things about Δ for a given number of clusters |Δ|, but then you also have to prove things about what the optimal value of |Δ| is.
Sounds like you’ve correctly understood the problem and are thinking along roughly the right lines. I expect a deterministic function of X won’t work, though.
Hand-wavily: the problem is that, if we take the latent to be a deterministic function Δ(X), then P[X|Δ(X)] has lots of zeros in it—not approximate-zeros, but true zeros. That will tend to blow up the KL-divergences in the approximation conditions.
I’d recommend looking for a function Δ(Λ). Unfortunately that does mean that low entropy of Δ(Λ)given X has to be proven.
I’m confused by this. The KL term we are looking at in the deterministic case is
DKL(P[X,Λ]||P[Λ]P[X1|Λ]P[X2|Λ]), right?
For simplicity, we imagine we have finite discrete spaces. Then this would blow up if P[X=(x1,x2),Λ=λ]≠0, and P[Λ=λ]P[X1=x1|Λ=λ]P[X2=x2|Λ=λ]=0. But this is impossible, because any of the terms in the product being 0 imply that P[X=(x1,x2),Λ=λ] is 0.
Intuitively, we construct an optimal code for encoding the distribution P[Λ]P[X1|Λ]P[X2|Λ], and the KL divergence measures how many more bits on average we need to encode a message than optimal, if the true distribution is given by P[X,Λ]. Issues occur when but the true distribution P[X,Λ] takes on values which never occur according to P[Λ]P[X1|Λ]P[X2|Λ], i.e: the optimal code doesn’t account for those values potentially occurring.
Potentially there are subtleties when we have continuous spaces. In any case I’d be grateful if you’re able to elaborate.
Yeah, I’ve since updated that deterministic functions are probably the right thing here after all, and I was indeed wrong in exactly the way you’re pointing out.
Huh, I had vaguely considered that but I expected any P[X|Δ(X)]=0 terms to be counterbalanced by P[X,Δ(X)]=0 terms, which together contribute nothing to the KL-divergence. I’ll check my intuitions though.
I’m honestly pretty stumped at the moment. The simplest test case I’ve been using is for X1 and X2 to be two flips of a biased coin, where the bias is known to be either k or 1−k with equal probability of either. As k varies, we want to swap from Δ≅Λ to the trivial case |Δ|=1 and back. This (optimally) happens at around k=0.08 and k=0.92. If we swap there, then the sum of errors for the three diagrams of Δ does remain less than 2(ϵ+ϵ+ϵ) at all times.
Likewise, if we do try to define Δ(X), we need to swap from a Δ which is equal to the number of heads, to |Δ|=1, and back.
In neither case can I find a construction of Δ(X) or Δ(Λ) which swaps from one phase to the other at the right time! My final thought is for Δ to be some mapping Λ→P(Λ) consisting of a ball in probability space of variable radius (no idea how to calculate the radius) which would take k→{k} at k≈1 and k→{k,1−k} at k≈0.5. Or maybe you have to map Λ→P(X) or something like that. But for now I don’t even have a construction I can try to prove things for.
Perhaps a constructive approach isn’t feasible, which probably means I don’t have quite the right skillset to do this.