Does The Information-Throughput-Maximizing Input Distribution To A Sparsely-Connected Channel Satisfy An Undirected Graphical Model?
[EDIT: Never mind, proved it.]
Suppose I have an information channel X→Y. The X components X1,...,Xm and the Y components Y1,...,Yn are sparsely connected, i.e. the typical Yi is downstream of only a few parent X-components Xpa(i). (Mathematically, that means the channel factors as P[Y|X]=∏iP[Yi|Xpa(i)].)
Now, suppose I split the Y components into two sets, and hold constant any X-components which are upstream of components in both sets. Conditional on those (relatively few) X-components, our channel splits into two independent channels.
E.g. in the image above, if I hold X4 constant, then I have two independent channels: (X1,X2,X3)→(Y1,Y2,Y3,Y4) and (X5,X6,X7)→(Y5,Y6,Y7,Y8).
Now, the information-throughput-maximizing input distribution to a pair of independent channels is just the product of the throughput maximizing distributions for the two channels individually. In other words: for independent channels, we have independent throughput maximizing distribution.
So it seems like a natural guess that something similar would happen in our sparse setup.
Conjecture: The throughput-maximizing distribution for our sparse setup is independent conditional on overlapping X-components. E.g. in the example above, we’d guess that P[X]=P[X4]P[X1,X2,X3|X4]P[X5,X6,X7|X4] for the throughput maximizing distribution.
If that’s true in general, then we can apply it to any Markov blanket in our sparse channel setup, so it implies that P[X] factors over any set of X components which is a Markov blanket splitting the original channel graph. In other words: it would imply that the throughput-maximizing distribution satisfies an undirected graphical model, in which two X-components share an edge if-and-only-if they share a child Y-component.
It’s not obvious that this works mathematically; information throughput maximization (i.e. the optimization problem by which one computes channel capacity) involves some annoying coupling between terms. But it makes sense intuitively. I’ve spent less than an hour trying to prove it and mostly found it mildly annoying though not clearly intractable. Seems like the sort of thing where either (a) someone has already proved it, or (b) someone more intimately familiar with channel capacity problems than I am could easily prove it.
So: anybody know of an existing proof (or know that the conjecture is false), or find this conjecture easy to prove themselves?
Specifically, we’ll show that there exists an information throughput maximizing distribution which satisfies the undirected graph. We will not show that all optimal distributions satisfy the undirected graph, because that’s false in some trivial cases—e.g. if all the Y’s are completely independent of X, then all distributions are optimal. We will also not show that all optimal distributions factor over the undirected graph, which is importantly different because of the P[X]>0 caveat in the Hammersley-Clifford theorem.
First, we’ll prove the (already known) fact that an independent distribution P[X]=P[X1]P[X2] is optimal for a pair of independent channels (X1→Y1,X2→Y2); we’ll prove it in a way which will play well with the proof of our more general theorem. Using standard information identities plus the factorization structure Y1−X1−X2−Y2 (that’s a Markov chain, not subtraction), we get
MI(X;Y)=MI(X;Y1)+MI(X;Y2|Y1)
=MI(X;Y1)+(MI(X;Y2)−MI(Y2;Y1)+MI(Y2;Y1|X))
=MI(X1;Y1)+MI(X2;Y2)−MI(Y2;Y1)
Now, suppose you hand me some supposedly-optimal distribution P[X]. From P, I construct a new distribution Q[X]:=P[X1]P[X2]. Note that MI(X1;Y1) and MI(X2;Y2) are both the same under Q as under P, while MI(Y2;Y1) is zero under Q. So, because MI(X;Y)=MI(X1;Y1)+MI(X2;Y2)−MI(Y2;Y1), the MI(X;Y) must be at least as large under Q as under P. In short: given any distribution, I can construct another distribution with as least as high information throughput, under which X1 and X2 are independent.
Now let’s tackle our more general theorem, reusing some of the machinery above.
I’ll split Y into Y1 and Y2, and split X into X1−2 (parents of Y1 but not Y2), X2−1 (parents of Y2 but not Y1), and X1∩2 (parents of both). Then
MI(X;Y)=MI(X1∩2;Y)+MI(X1−2,X2−1;Y|X1∩2)
In analogy to the case above, we consider distribution P[X], and construct a new distribution Q[X]:=P[X1∩2]P[X1−2|X1∩2]P[X2−1|X1∩2]. Compared to P, Q has the same value of MI(X1∩2;Y), and by exactly the same argument as the independent case MI(X1−2,X2−1;Y|X1∩2) cannot be any higher under Q; we just repeat the same argument with everything conditional on X1∩2 throughout. So, given any distribution, I can construct another distribution with at least as high information throughput, under which X1−2 and X2−1 are independent given X1∩2.
Since this works for any Markov blanket X1∩2, there exists an information thoughput maximizing distribution which satisfies the desired undirected graph.
I suppose another way to look at this is the overlapping components are the blanket states in some kind of time dependent markov blanket setup, right?
In the scenario you created you could treat x1,x2,x3as the some shielded state at time step t, so it. Then x5,x6,x7 are states outside of the blanket, so et (which group of states is i and which is e don’t really matter, so long as they are on either side of the blanket). y1,y2,y3,y4[1]become it+1, and y5,y6,y7,y8 become et+1.
Then x4 becomes the blanket bt such that
I(it+1,et+1|bt)≈0
and
P(it+1,et+1|it,et,bt)=P(it+1|it,bt)⋅P(et+1|et,bt)
With all that implies. In fact you can just as easily have three shielded states, or four, using this formulation.
(the setup for this is shamelessly ripped off from @Gunnar_Zarncke ’s unsupervised agent detection work)
(Was in the middle of writing a proof before noticing you did it already)
I believe the end result is that if we have Y=(Y1,Y2), X=(X1,X2,X3) with P(Y|X)=P(Y1|X1,X3)P(Y2|X2,X3) (X1 upstream of Y1, X2 upstream of Y2, X3 upstream of both),
then maximizing I(X;Y) is equivalent to maximizing I(Y1;X1,X3)+I(Y2;X2,X3)−I(Y1;Y2).
& for the proof we can basically replicate the proof for additivity except substituting the factorization P(X1,X2,X3)=P(X3)P(X1|X3)P(X2|X3) as assumption in place of independence, then both directions of inequality will result in I(Y1;X1,X3)+I(Y2;X2,X3)−I(Y1;Y2).
[EDIT: Forgot −I(Y1;Y2) term due to marginal dependence P(Y1,Y2)≠P(Y1)P(Y2)]
Does The Information-Throughput-Maximizing Input Distribution To A Sparsely-Connected Channel Satisfy An Undirected Graphical Model?
[EDIT: Never mind, proved it.]
Suppose I have an information channel X→Y. The X components X1,...,Xm and the Y components Y1,...,Yn are sparsely connected, i.e. the typical Yi is downstream of only a few parent X-components Xpa(i). (Mathematically, that means the channel factors as P[Y|X]=∏iP[Yi|Xpa(i)].)
Now, suppose I split the Y components into two sets, and hold constant any X-components which are upstream of components in both sets. Conditional on those (relatively few) X-components, our channel splits into two independent channels.
E.g. in the image above, if I hold X4 constant, then I have two independent channels: (X1,X2,X3)→(Y1,Y2,Y3,Y4) and (X5,X6,X7)→(Y5,Y6,Y7,Y8).
Now, the information-throughput-maximizing input distribution to a pair of independent channels is just the product of the throughput maximizing distributions for the two channels individually. In other words: for independent channels, we have independent throughput maximizing distribution.
So it seems like a natural guess that something similar would happen in our sparse setup.
Conjecture: The throughput-maximizing distribution for our sparse setup is independent conditional on overlapping X-components. E.g. in the example above, we’d guess that P[X]=P[X4]P[X1,X2,X3|X4]P[X5,X6,X7|X4] for the throughput maximizing distribution.
If that’s true in general, then we can apply it to any Markov blanket in our sparse channel setup, so it implies that P[X] factors over any set of X components which is a Markov blanket splitting the original channel graph. In other words: it would imply that the throughput-maximizing distribution satisfies an undirected graphical model, in which two X-components share an edge if-and-only-if they share a child Y-component.
It’s not obvious that this works mathematically; information throughput maximization (i.e. the optimization problem by which one computes channel capacity) involves some annoying coupling between terms. But it makes sense intuitively. I’ve spent less than an hour trying to prove it and mostly found it mildly annoying though not clearly intractable. Seems like the sort of thing where either (a) someone has already proved it, or (b) someone more intimately familiar with channel capacity problems than I am could easily prove it.
So: anybody know of an existing proof (or know that the conjecture is false), or find this conjecture easy to prove themselves?
Proof
Specifically, we’ll show that there exists an information throughput maximizing distribution which satisfies the undirected graph. We will not show that all optimal distributions satisfy the undirected graph, because that’s false in some trivial cases—e.g. if all the Y’s are completely independent of X, then all distributions are optimal. We will also not show that all optimal distributions factor over the undirected graph, which is importantly different because of the P[X]>0 caveat in the Hammersley-Clifford theorem.
First, we’ll prove the (already known) fact that an independent distribution P[X]=P[X1]P[X2] is optimal for a pair of independent channels (X1→Y1,X2→Y2); we’ll prove it in a way which will play well with the proof of our more general theorem. Using standard information identities plus the factorization structure Y1−X1−X2−Y2 (that’s a Markov chain, not subtraction), we get
MI(X;Y)=MI(X;Y1)+MI(X;Y2|Y1)
=MI(X;Y1)+(MI(X;Y2)−MI(Y2;Y1)+MI(Y2;Y1|X))
=MI(X1;Y1)+MI(X2;Y2)−MI(Y2;Y1)
Now, suppose you hand me some supposedly-optimal distribution P[X]. From P, I construct a new distribution Q[X]:=P[X1]P[X2]. Note that MI(X1;Y1) and MI(X2;Y2) are both the same under Q as under P, while MI(Y2;Y1) is zero under Q. So, because MI(X;Y)=MI(X1;Y1)+MI(X2;Y2)−MI(Y2;Y1), the MI(X;Y) must be at least as large under Q as under P. In short: given any distribution, I can construct another distribution with as least as high information throughput, under which X1 and X2 are independent.
Now let’s tackle our more general theorem, reusing some of the machinery above.
I’ll split Y into Y1 and Y2, and split X into X1−2 (parents of Y1 but not Y2), X2−1 (parents of Y2 but not Y1), and X1∩2 (parents of both). Then
MI(X;Y)=MI(X1∩2;Y)+MI(X1−2,X2−1;Y|X1∩2)
In analogy to the case above, we consider distribution P[X], and construct a new distribution Q[X]:=P[X1∩2]P[X1−2|X1∩2]P[X2−1|X1∩2]. Compared to P, Q has the same value of MI(X1∩2;Y), and by exactly the same argument as the independent case MI(X1−2,X2−1;Y|X1∩2) cannot be any higher under Q; we just repeat the same argument with everything conditional on X1∩2 throughout. So, given any distribution, I can construct another distribution with at least as high information throughput, under which X1−2 and X2−1 are independent given X1∩2.
Since this works for any Markov blanket X1∩2, there exists an information thoughput maximizing distribution which satisfies the desired undirected graph.
I suppose another way to look at this is the overlapping components are the blanket states in some kind of time dependent markov blanket setup, right?
In the scenario you created you could treat x1,x2,x3as the some shielded state at time step t, so it. Then x5,x6,x7 are states outside of the blanket, so et (which group of states is i and which is e don’t really matter, so long as they are on either side of the blanket). y1,y2,y3,y4 [1]become it+1, and y5,y6,y7,y8 become et+1.
Then x4 becomes the blanket bt such that
I(it+1,et+1|bt)≈0
and
P(it+1,et+1|it,et,bt)=P(it+1|it,bt)⋅P(et+1|et,bt)
With all that implies. In fact you can just as easily have three shielded states, or four, using this formulation.
(the setup for this is shamelessly ripped off from @Gunnar_Zarncke ’s unsupervised agent detection work)
Did you miss an arrow going to y4 ?
(Was in the middle of writing a proof before noticing you did it already)
I believe the end result is that if we have Y=(Y1,Y2), X=(X1,X2,X3) with P(Y|X)=P(Y1|X1,X3)P(Y2|X2,X3) (X1 upstream of Y1, X2 upstream of Y2, X3 upstream of both),
then maximizing I(X;Y) is equivalent to maximizing I(Y1;X1,X3)+I(Y2;X2,X3)−I(Y1;Y2).
& for the proof we can basically replicate the proof for additivity except substituting the factorization P(X1,X2,X3)=P(X3)P(X1|X3)P(X2|X3) as assumption in place of independence, then both directions of inequality will result in I(Y1;X1,X3)+I(Y2;X2,X3)−I(Y1;Y2).
[EDIT: Forgot −I(Y1;Y2) term due to marginal dependence P(Y1,Y2)≠P(Y1)P(Y2)]