johnswentworth comments on johnswentworth’s Shortform

johnswentworth 8 Sep 2025 17:20 UTC
LW: 6 AF: 3
0
AF
Does The Information-Throughput-Maximizing Input Distribution To A Sparsely-Connected Channel Satisfy An Undirected Graphical Model?
[EDIT: Never mind, proved it.]
Suppose I have an information channel $X \to Y$ . The X components $X_{1}, . . ., X_{m}$ and the Y components $Y_{1}, . . ., Y_{n}$ are sparsely connected, i.e. the typical $Y_{i}$ is downstream of only a few parent X-components $X_{p a (i)}$ . (Mathematically, that means the channel factors as $P [Y | X] = \prod_{i} P [Y_{i} | X_{p a (i)}]$ .)
Now, suppose I split the Y components into two sets, and hold constant any X-components which are upstream of components in both sets. Conditional on those (relatively few) X-components, our channel splits into two independent channels.
E.g. in the image above, if I hold $X_{4}$ constant, then I have two independent channels: $(X_{1}, X_{2}, X_{3}) \to (Y_{1}, Y_{2}, Y_{3}, Y_{4})$ and $(X_{5}, X_{6}, X_{7}) \to (Y_{5}, Y_{6}, Y_{7}, Y_{8})$ .
Now, the information-throughput-maximizing input distribution to a pair of independent channels is just the product of the throughput maximizing distributions for the two channels individually. In other words: for independent channels, we have independent throughput maximizing distribution.
So it seems like a natural guess that something similar would happen in our sparse setup.
Conjecture: The throughput-maximizing distribution for our sparse setup is independent conditional on overlapping X-components. E.g. in the example above, we’d guess that $P [X] = P [X_{4}] P [X_{1}, X_{2}, X_{3} | X_{4}] P [X_{5}, X_{6}, X_{7} | X_{4}]$ for the throughput maximizing distribution.
If that’s true in general, then we can apply it to any Markov blanket in our sparse channel setup, so it implies that $P [X]$ factors over any set of X components which is a Markov blanket splitting the original channel graph. In other words: it would imply that the throughput-maximizing distribution satisfies an undirected graphical model, in which two X-components share an edge if-and-only-if they share a child Y-component.
It’s not obvious that this works mathematically; information throughput maximization (i.e. the optimization problem by which one computes channel capacity) involves some annoying coupling between terms. But it makes sense intuitively. I’ve spent less than an hour trying to prove it and mostly found it mildly annoying though not clearly intractable. Seems like the sort of thing where either (a) someone has already proved it, or (b) someone more intimately familiar with channel capacity problems than I am could easily prove it.
So: anybody know of an existing proof (or know that the conjecture is false), or find this conjecture easy to prove themselves?
- johnswentworth 9 Sep 2025 21:35 UTC
  LW: 8 AF: 6
  0
  AF Parent
  Proof
  Specifically, we’ll show that there exists an information throughput maximizing distribution which satisfies the undirected graph. We will not show that all optimal distributions satisfy the undirected graph, because that’s false in some trivial cases—e.g. if all the $Y$ ’s are completely independent of $X$ , then all distributions are optimal. We will also not show that all optimal distributions factor over the undirected graph, which is importantly different because of the $P [X] > 0$ caveat in the Hammersley-Clifford theorem.
  First, we’ll prove the (already known) fact that an independent distribution $P [X] = P [X_{1}] P [X_{2}]$ is optimal for a pair of independent channels $(X_{1} \to Y_{1}, X_{2} \to Y_{2})$ ; we’ll prove it in a way which will play well with the proof of our more general theorem. Using standard information identities plus the factorization structure $Y_{1} - X_{1} - X_{2} - Y_{2}$ (that’s a Markov chain, not subtraction), we get
  $M I (X; Y) = M I (X; Y_{1}) + M I (X; Y_{2} | Y_{1})$
  $= M I (X; Y_{1}) + (M I (X; Y_{2}) - M I (Y_{2}; Y_{1}) + M I (Y_{2}; Y_{1} | X))$
  $= M I (X_{1}; Y_{1}) + M I (X_{2}; Y_{2}) - M I (Y_{2}; Y_{1})$
  Now, suppose you hand me some supposedly-optimal distribution $P [X]$ . From $P$ , I construct a new distribution $Q [X] := P [X_{1}] P [X_{2}]$ . Note that $M I (X_{1}; Y_{1})$ and $M I (X_{2}; Y_{2})$ are both the same under $Q$ as under $P$ , while $M I (Y_{2}; Y_{1})$ is zero under $Q$ . So, because $M I (X; Y) = M I (X_{1}; Y_{1}) + M I (X_{2}; Y_{2}) - M I (Y_{2}; Y_{1})$ , the $M I (X; Y)$ must be at least as large under $Q$ as under $P$ . In short: given any distribution, I can construct another distribution with as least as high information throughput, under which $X_{1}$ and $X_{2}$ are independent.
  Now let’s tackle our more general theorem, reusing some of the machinery above.
  I’ll split $Y$ into $Y_{1}$ and $Y_{2}$ , and split $X$ into $X_{1 - 2}$ (parents of $Y_{1}$ but not $Y_{2}$ ), $X_{2 - 1}$ (parents of $Y_{2}$ but not $Y_{1}$ ), and $X_{1 \cap 2}$ (parents of both). Then
  $M I (X; Y) = M I (X_{1 \cap 2}; Y) + M I (X_{1 - 2}, X_{2 - 1}; Y | X_{1 \cap 2})$
  In analogy to the case above, we consider distribution $P [X]$ , and construct a new distribution $Q [X] := P [X_{1 \cap 2}] P [X_{1 - 2} | X_{1 \cap 2}] P [X_{2 - 1} | X_{1 \cap 2}]$ . Compared to $P$ , $Q$ has the same value of $M I (X_{1 \cap 2}; Y)$ , and by exactly the same argument as the independent case $M I (X_{1 - 2}, X_{2 - 1}; Y | X_{1 \cap 2})$ cannot be any higher under $Q$ ; we just repeat the same argument with everything conditional on $X_{1 \cap 2}$ throughout. So, given any distribution, I can construct another distribution with at least as high information throughput, under which $X_{1 - 2}$ and $X_{2 - 1}$ are independent given $X_{1 \cap 2}$ .
  Since this works for any Markov blanket $X_{1 \cap 2}$ , there exists an information thoughput maximizing distribution which satisfies the desired undirected graph.
- testingthewaters 8 Sep 2025 22:41 UTC
  6 points
  0
  Parent
  I suppose another way to look at this is the overlapping components are the blanket states in some kind of time dependent markov blanket setup, right?
  In the scenario you created you could treat $x_{1}, x_{2}, x_{3}$ as the some shielded state at time step $t$ , so $i_{t}$ . Then $x_{5}, x_{6}, x_{7}$ are states outside of the blanket, so $e_{t}$ (which group of states is $i$ and which is $e$ don’t really matter, so long as they are on either side of the blanket). $y_{1}, y_{2}, y_{3}, y_{4}$ ^[1]become $i_{t + 1}$ , and $y_{5}, y_{6}, y_{7}, y_{8}$ become $e_{t + 1}$ .
  Then $x_{4}$ becomes the blanket $b_{t}$ such that
  $I (i_{t + 1}, e_{t + 1} | b_{t}) \approx 0$
  and
  $P (i_{t + 1}, e_{t + 1} | i_{t}, e_{t}, b_{t}) = P (i_{t + 1} | i_{t}, b_{t}) \cdot P (e_{t + 1} | e_{t}, b_{t})$
  With all that implies. In fact you can just as easily have three shielded states, or four, using this formulation.
  (the setup for this is shamelessly ripped off from @Gunnar_Zarncke ’s unsupervised agent detection work)
  1. ^
    Did you miss an arrow going to $y_{4}$ ?
- Daniel C 8 Sep 2025 19:46 UTC
  3 points
  0
  Parent
  (Was in the middle of writing a proof before noticing you did it already)
  I believe the end result is that if we have $Y = (Y_{1}, Y_{2})$ , $X = (X_{1}, X_{2}, X_{3})$ with $P (Y | X) = P (Y_{1} | X_{1}, X_{3}) P (Y_{2} | X_{2}, X_{3})$ ( $X_{1}$ upstream of $Y_{1}$ , $X_{2}$ upstream of $Y_{2}$ , $X_{3}$ upstream of both),
  then maximizing $I (X; Y)$ is equivalent to maximizing $I (Y_{1}; X_{1}, X_{3})$ $+ I (Y_{2}; X_{2}, X_{3})$ $- I (Y_{1}; Y_{2})$ .
  & for the proof we can basically replicate the proof for additivity except substituting the factorization $P (X_{1}, X_{2}, X_{3}) = P (X_{3}) P (X_{1} | X_{3}) P (X_{2} | X_{3})$ as assumption in place of independence, then both directions of inequality will result in $I (Y_{1}; X_{1}, X_{3})$ $+ I (Y_{2}; X_{2}, X_{3})$ $- I (Y_{1}; Y_{2})$ .
  [EDIT: Forgot $- I (Y_{1}; Y_{2})$ term due to marginal dependence $P (Y_{1}, Y_{2}) \neq P (Y_{1}) P (Y_{2})$ ]

johnswentworth comments on johnswentworth’s Shortform

Does The Information-Throughput-Maximizing Input Distribution To A Sparsely-Connected Channel Satisfy An Undirected Graphical Model?

Proof