Specifically, we’ll show that there exists an information throughput maximizing distribution which satisfies the undirected graph. We will not show that all optimal distributions satisfy the undirected graph, because that’s false in some trivial cases—e.g. if all the Y’s are completely independent of X, then all distributions are optimal. We will also not show that all optimal distributions factor over the undirected graph, which is importantly different because of the P[X]>0 caveat in the Hammersley-Clifford theorem.
First, we’ll prove the (already known) fact that an independent distribution P[X]=P[X1]P[X2] is optimal for a pair of independent channels (X1→Y1,X2→Y2); we’ll prove it in a way which will play well with the proof of our more general theorem. Using standard information identities plus the factorization structure Y1−X1−X2−Y2 (that’s a Markov chain, not subtraction), we get
MI(X;Y)=MI(X;Y1)+MI(X;Y2|Y1)
=MI(X;Y1)+(MI(X;Y2)−MI(Y2;Y1)+MI(Y2;Y1|X))
=MI(X1;Y1)+MI(X2;Y2)−MI(Y2;Y1)
Now, suppose you hand me some supposedly-optimal distribution P[X]. From P, I construct a new distribution Q[X]:=P[X1]P[X2]. Note that MI(X1;Y1) and MI(X2;Y2) are both the same under Q as under P, while MI(Y2;Y1) is zero under Q. So, because MI(X;Y)=MI(X1;Y1)+MI(X2;Y2)−MI(Y2;Y1), the MI(X;Y) must be at least as large under Q as under P. In short: given any distribution, I can construct another distribution with as least as high information throughput, under which X1 and X2 are independent.
Now let’s tackle our more general theorem, reusing some of the machinery above.
I’ll split Y into Y1 and Y2, and split X into X1−2 (parents of Y1 but not Y2), X2−1 (parents of Y2 but not Y1), and X1∩2 (parents of both). Then
MI(X;Y)=MI(X1∩2;Y)+MI(X1−2,X2−1;Y|X1∩2)
In analogy to the case above, we consider distribution P[X], and construct a new distribution Q[X]:=P[X1∩2]P[X1−2|X1∩2]P[X2−1|X1∩2]. Compared to P, Q has the same value of MI(X1∩2;Y), and by exactly the same argument as the independent case MI(X1−2,X2−1;Y|X1∩2) cannot be any higher under Q; we just repeat the same argument with everything conditional on X1∩2 throughout. So, given any distribution, I can construct another distribution with at least as high information throughput, under which X1−2 and X2−1 are independent given X1∩2.
Since this works for any Markov blanket X1∩2, there exists an information thoughput maximizing distribution which satisfies the desired undirected graph.
Proof
Specifically, we’ll show that there exists an information throughput maximizing distribution which satisfies the undirected graph. We will not show that all optimal distributions satisfy the undirected graph, because that’s false in some trivial cases—e.g. if all the Y’s are completely independent of X, then all distributions are optimal. We will also not show that all optimal distributions factor over the undirected graph, which is importantly different because of the P[X]>0 caveat in the Hammersley-Clifford theorem.
First, we’ll prove the (already known) fact that an independent distribution P[X]=P[X1]P[X2] is optimal for a pair of independent channels (X1→Y1,X2→Y2); we’ll prove it in a way which will play well with the proof of our more general theorem. Using standard information identities plus the factorization structure Y1−X1−X2−Y2 (that’s a Markov chain, not subtraction), we get
MI(X;Y)=MI(X;Y1)+MI(X;Y2|Y1)
=MI(X;Y1)+(MI(X;Y2)−MI(Y2;Y1)+MI(Y2;Y1|X))
=MI(X1;Y1)+MI(X2;Y2)−MI(Y2;Y1)
Now, suppose you hand me some supposedly-optimal distribution P[X]. From P, I construct a new distribution Q[X]:=P[X1]P[X2]. Note that MI(X1;Y1) and MI(X2;Y2) are both the same under Q as under P, while MI(Y2;Y1) is zero under Q. So, because MI(X;Y)=MI(X1;Y1)+MI(X2;Y2)−MI(Y2;Y1), the MI(X;Y) must be at least as large under Q as under P. In short: given any distribution, I can construct another distribution with as least as high information throughput, under which X1 and X2 are independent.
Now let’s tackle our more general theorem, reusing some of the machinery above.
I’ll split Y into Y1 and Y2, and split X into X1−2 (parents of Y1 but not Y2), X2−1 (parents of Y2 but not Y1), and X1∩2 (parents of both). Then
MI(X;Y)=MI(X1∩2;Y)+MI(X1−2,X2−1;Y|X1∩2)
In analogy to the case above, we consider distribution P[X], and construct a new distribution Q[X]:=P[X1∩2]P[X1−2|X1∩2]P[X2−1|X1∩2]. Compared to P, Q has the same value of MI(X1∩2;Y), and by exactly the same argument as the independent case MI(X1−2,X2−1;Y|X1∩2) cannot be any higher under Q; we just repeat the same argument with everything conditional on X1∩2 throughout. So, given any distribution, I can construct another distribution with at least as high information throughput, under which X1−2 and X2−1 are independent given X1∩2.
Since this works for any Markov blanket X1∩2, there exists an information thoughput maximizing distribution which satisfies the desired undirected graph.