I’ve thought about it a bit, I have a line of attack for a proof, but there’s too much work involved in following it through to an actual proof so I’m going to leave it here in case it helps anyone.
I’m assuming everything is discrete so I can work with regular Shannon entropy.
Consider the range R1 of the function g1:λ↦P(X1|Λ=λ) and R2 defined similarly. Discretize R1 and R2 (chop them up into little balls). Not sure which metric to use, maybe TV.
Define Λ′1(λ) to be the index of the ball into which P(X1|Λ=λ) falls, Λ′2 similar. So if d(P(X1|Λ=a),P(X1|Λ=b)) is sufficiently small, then Λ′1(a)=Λ′1(b).
By the data processing inequality, conditions 2 and 3 still hold for Λ′=(Λ′1,Λ′2). Condition 1 should hold with some extra slack depending on the coarseness of the discretization.
It takes a few steps, but I think you might be able to argue that, with high probability, for each X2=x2, the random variable Q1:=P(X1|Λ′1) will be highly concentrated (n.b. I’ve only worked it through fully in the exact case, and I think it can be translated to the approximate case but I haven’t checked). We then invoke the discretization to argue that H(Λ′1|X1) is bounded. The intuition is that the discretization forces nearby probabilities to coincide, so if Q1 is concentrated then it actually has to “collapse” most of its mass onto a few discrete values.
We can then make a similar argument switching the indices to get H(Λ′2|X2) bounded. Finally, maybe applying conditions 2 and 3 we can get H(Λ′1|X2) bounded as well, which then gives a bound on H(Λ|Xi).
I did try feeding this to Gemini but it wasn’t able to produce a proof.
I’ve been working on the reverse direction: chopping up P[Λ] by clustering the points (treating each distribution as a point in distribution space) given by P[Λ|X=x], optimizing for a deterministic-in-X latent Δ=Δ(X) which minimizes DKL(P[Λ|X]||P[Λ|Δ(X)]).
This definitely separates X1 and X2 to some small error, since we can just use Δ to build a distribution over Λ which should approximately separate X1 and X2.
To show that it’s deterministic in X1 (and by symmetry X2) to some small error, I was hoping to use the fact that—givenX1X2—has very little information about Λ, so it’s unlikely that P[Λ|X1] is in a different cluster to P[Λ|X1,X2]. This means that P[Δ|X1] would just put most of the weight on the cluster containing P[Λ|X1].
A constructive approach for Δ would be marginally more useful in the long-run, but it’s also probably easier to prove things about the optimal Δ. It’s also probably easier to prove things about Δ for a given number of clusters |Δ|, but then you also have to prove things about what the optimal value of |Δ| is.
Sounds like you’ve correctly understood the problem and are thinking along roughly the right lines. I expect a deterministic function of X won’t work, though.
Hand-wavily: the problem is that, if we take the latent to be a deterministic function Δ(X), then P[X|Δ(X)] has lots of zeros in it—not approximate-zeros, but true zeros. That will tend to blow up the KL-divergences in the approximation conditions.
I’d recommend looking for a function Δ(Λ). Unfortunately that does mean that low entropy of Δ(Λ)given X has to be proven.
I’m confused by this. The KL term we are looking at in the deterministic case is DKL(P[X,Λ]||P[Λ]P[X1|Λ]P[X2|Λ]), right?
For simplicity, we imagine we have finite discrete spaces. Then this would blow up if P[X=(x1,x2),Λ=λ]≠0, and P[Λ=λ]P[X1=x1|Λ=λ]P[X2=x2|Λ=λ]=0. But this is impossible, because any of the terms in the product being 0 imply that P[X=(x1,x2),Λ=λ] is 0.
Intuitively, we construct an optimal code for encoding the distribution P[Λ]P[X1|Λ]P[X2|Λ], and the KL divergence measures how many more bits on average we need to encode a message than optimal, if the true distribution is given by P[X,Λ]. Issues occur when but the true distribution P[X,Λ] takes on values which never occur according to P[Λ]P[X1|Λ]P[X2|Λ], i.e: the optimal code doesn’t account for those values potentially occurring.
Potentially there are subtleties when we have continuous spaces. In any case I’d be grateful if you’re able to elaborate.
Yeah, I’ve since updated that deterministic functions are probably the right thing here after all, and I was indeed wrong in exactly the way you’re pointing out.
Huh, I had vaguely considered that but I expected any P[X|Δ(X)]=0 terms to be counterbalanced by P[X,Δ(X)]=0 terms, which together contribute nothing to the KL-divergence. I’ll check my intuitions though.
I’m honestly pretty stumped at the moment. The simplest test case I’ve been using is for X1 and X2 to be two flips of a biased coin, where the bias is known to be either k or 1−k with equal probability of either. As k varies, we want to swap from Δ≅Λ to the trivial case |Δ|=1 and back. This (optimally) happens at around k=0.08 and k=0.92. If we swap there, then the sum of errors for the three diagrams of Δ does remain less than 2(ϵ+ϵ+ϵ) at all times.
Likewise, if we do try to define Δ(X), we need to swap from a Δ which is equal to the number of heads, to |Δ|=1, and back.
In neither case can I find a construction of Δ(X) or Δ(Λ) which swaps from one phase to the other at the right time! My final thought is for Δ to be some mapping Λ→P(Λ) consisting of a ball in probability space of variable radius (no idea how to calculate the radius) which would take k→{k} at k≈1 and k→{k,1−k} at k≈0.5. Or maybe you have to map Λ→P(X) or something like that. But for now I don’t even have a construction I can try to prove things for.
Perhaps a constructive approach isn’t feasible, which probably means I don’t have quite the right skillset to do this.
OK so some further thoughts on this: suppose we instead just partition the values of Λ directly by something like a clustering algorithm, based on DKL in P[X|Λ] space, and take Δ(Λ) just be the cluster that λ is in:
Assuming we can do it with small clusters, we know that P[X|Λ]≈P[X|Δ] is pretty small, so DKL(P[X]||P[X|Δ]) is also small.
And if we consider X2←X1→Λ, this tells us that learning X1 restricts us to a pretty small region of P[X2] space (since P[X2|X1]≈P[X2|X1,Λ]) so Δ should be approximately deterministic in X1. This second part is more difficult to formalize, though.
Edit: The real issue is whether or not we could have lots of Λ values which produce the same distribution over X2 but different distributions over X1, and all be pretty likely given X1=x1 for some x1. I think this just can’t really happen for probable values of x1, because if these values of λ produce the same distribution over X2, but different distributions over X1, then that doesn’t satisfy X1←X2→Λ, and secondly because if they produced wildly different distributions over X1, then that means they can’t all have high values of P[X1=x1|Λ=λ], and so they’re not gonna have high values of P[Λ=λ|X1=x1].
I’ve thought about it a bit, I have a line of attack for a proof, but there’s too much work involved in following it through to an actual proof so I’m going to leave it here in case it helps anyone.
I’m assuming everything is discrete so I can work with regular Shannon entropy.
Consider the range R1 of the function g1:λ↦P(X1|Λ=λ) and R2 defined similarly. Discretize R1 and R2 (chop them up into little balls). Not sure which metric to use, maybe TV.
Define Λ′1(λ) to be the index of the ball into which P(X1|Λ=λ) falls, Λ′2 similar. So if d(P(X1|Λ=a),P(X1|Λ=b)) is sufficiently small, then Λ′1(a)=Λ′1(b).
By the data processing inequality, conditions 2 and 3 still hold for Λ′=(Λ′1,Λ′2). Condition 1 should hold with some extra slack depending on the coarseness of the discretization.
It takes a few steps, but I think you might be able to argue that, with high probability, for each X2=x2, the random variable Q1:=P(X1|Λ′1) will be highly concentrated (n.b. I’ve only worked it through fully in the exact case, and I think it can be translated to the approximate case but I haven’t checked). We then invoke the discretization to argue that H(Λ′1|X1) is bounded. The intuition is that the discretization forces nearby probabilities to coincide, so if Q1 is concentrated then it actually has to “collapse” most of its mass onto a few discrete values.
We can then make a similar argument switching the indices to get H(Λ′2|X2) bounded. Finally, maybe applying conditions 2 and 3 we can get H(Λ′1|X2) bounded as well, which then gives a bound on H(Λ|Xi).
I did try feeding this to Gemini but it wasn’t able to produce a proof.
I’ve been working on the reverse direction: chopping up P[Λ] by clustering the points (treating each distribution as a point in distribution space) given by P[Λ|X=x], optimizing for a deterministic-in-X latent Δ=Δ(X) which minimizes DKL(P[Λ|X]||P[Λ|Δ(X)]).
This definitely separates X1 and X2 to some small error, since we can just use Δ to build a distribution over Λ which should approximately separate X1 and X2.
To show that it’s deterministic in X1 (and by symmetry X2) to some small error, I was hoping to use the fact that—givenX1X2—has very little information about Λ, so it’s unlikely that P[Λ|X1] is in a different cluster to P[Λ|X1,X2]. This means that P[Δ|X1] would just put most of the weight on the cluster containing P[Λ|X1].
A constructive approach for Δ would be marginally more useful in the long-run, but it’s also probably easier to prove things about the optimal Δ. It’s also probably easier to prove things about Δ for a given number of clusters |Δ|, but then you also have to prove things about what the optimal value of |Δ| is.
Sounds like you’ve correctly understood the problem and are thinking along roughly the right lines. I expect a deterministic function of X won’t work, though.
Hand-wavily: the problem is that, if we take the latent to be a deterministic function Δ(X), then P[X|Δ(X)] has lots of zeros in it—not approximate-zeros, but true zeros. That will tend to blow up the KL-divergences in the approximation conditions.
I’d recommend looking for a function Δ(Λ). Unfortunately that does mean that low entropy of Δ(Λ)given X has to be proven.
I’m confused by this. The KL term we are looking at in the deterministic case is
DKL(P[X,Λ]||P[Λ]P[X1|Λ]P[X2|Λ]), right?
For simplicity, we imagine we have finite discrete spaces. Then this would blow up if P[X=(x1,x2),Λ=λ]≠0, and P[Λ=λ]P[X1=x1|Λ=λ]P[X2=x2|Λ=λ]=0. But this is impossible, because any of the terms in the product being 0 imply that P[X=(x1,x2),Λ=λ] is 0.
Intuitively, we construct an optimal code for encoding the distribution P[Λ]P[X1|Λ]P[X2|Λ], and the KL divergence measures how many more bits on average we need to encode a message than optimal, if the true distribution is given by P[X,Λ]. Issues occur when but the true distribution P[X,Λ] takes on values which never occur according to P[Λ]P[X1|Λ]P[X2|Λ], i.e: the optimal code doesn’t account for those values potentially occurring.
Potentially there are subtleties when we have continuous spaces. In any case I’d be grateful if you’re able to elaborate.
Yeah, I’ve since updated that deterministic functions are probably the right thing here after all, and I was indeed wrong in exactly the way you’re pointing out.
Huh, I had vaguely considered that but I expected any P[X|Δ(X)]=0 terms to be counterbalanced by P[X,Δ(X)]=0 terms, which together contribute nothing to the KL-divergence. I’ll check my intuitions though.
I’m honestly pretty stumped at the moment. The simplest test case I’ve been using is for X1 and X2 to be two flips of a biased coin, where the bias is known to be either k or 1−k with equal probability of either. As k varies, we want to swap from Δ≅Λ to the trivial case |Δ|=1 and back. This (optimally) happens at around k=0.08 and k=0.92. If we swap there, then the sum of errors for the three diagrams of Δ does remain less than 2(ϵ+ϵ+ϵ) at all times.
Likewise, if we do try to define Δ(X), we need to swap from a Δ which is equal to the number of heads, to |Δ|=1, and back.
In neither case can I find a construction of Δ(X) or Δ(Λ) which swaps from one phase to the other at the right time! My final thought is for Δ to be some mapping Λ→P(Λ) consisting of a ball in probability space of variable radius (no idea how to calculate the radius) which would take k→{k} at k≈1 and k→{k,1−k} at k≈0.5. Or maybe you have to map Λ→P(X) or something like that. But for now I don’t even have a construction I can try to prove things for.
Perhaps a constructive approach isn’t feasible, which probably means I don’t have quite the right skillset to do this.
OK so some further thoughts on this: suppose we instead just partition the values of Λ directly by something like a clustering algorithm, based on DKL in P[X|Λ] space, and take Δ(Λ) just be the cluster that λ is in:
Assuming we can do it with small clusters, we know that P[X|Λ]≈P[X|Δ] is pretty small, so DKL(P[X]||P[X|Δ]) is also small.
And if we consider X2←X1→Λ, this tells us that learning X1 restricts us to a pretty small region of P[X2] space (since P[X2|X1]≈P[X2|X1,Λ]) so Δ should be approximately deterministic in X1. This second part is more difficult to formalize, though.
Edit: The real issue is whether or not we could have lots of Λ values which produce the same distribution over X2 but different distributions over X1, and all be pretty likely given X1=x1 for some x1. I think this just can’t really happen for probable values of x1, because if these values of λ produce the same distribution over X2, but different distributions over X1, then that doesn’t satisfy X1←X2→Λ, and secondly because if they produced wildly different distributions over X1, then that means they can’t all have high values of P[X1=x1|Λ=λ], and so they’re not gonna have high values of P[Λ=λ|X1=x1].