The standard definition of channel capacity makes no explicit reference to the original message G; it can be eliminated from the problem. We can do the same thing here, but it’s trickier. First, let’s walk through it for the standard channel capacity setup.
Standard Channel Capacity Setup
In the standard setup, A cannot depend on O, so our graph looks like
… and we can further remove O entirely by absorbing it into the stochasticity of Y.
Now, there are two key steps. First step: if A is not a deterministic function of G, then we can make A a deterministic function of G without reducing I(G;Y). Anywhere A is stochastic, we just read the random bits from some independent part of G instead; Y will have the same joint distribution with any parts of G which A was reading before, but will also potentially get some information about the newly-read bits of G as well.
Second step: note from the graphical structure that A mediates between G and Y. Since A is a deterministic function of GandA mediates between G and Y, we have I(G;Y)=I(A;Y).
Furthermore, we can achieve any distribution P[A] (to arbitrary precision) by choosing a suitable function A(G).
So, for the standard channel capacity problem, we have P[G,A,Y]=P[G]P[A|G]P[Y|A], and we can simplify the optimization problem:
(maxP[A|G]I(G;Y))=(maxP[A]I(A;Y))
Note that this all applies directly to our conjecture, for the part where actions do not depend on observations.
That’s how we get the standard expression for channel capacity. It would be potentially helpful to do something similar in our problem, allowing for observation of O.
Our Problem
The step about determinism of A carries over easily: if A is not a deterministic function of G and O, then we can change A to read random bits from an independent part of G. That will make A a deterministic function of G and O without reducing I(G;Y).
The second step fails: A does not mediate between G and Y.
However, we can define a “Policy” variable
π:=(o↦A(G,o))
π is also a deterministic function of G, and πdoes mediate between G and Y. And we can achieve any distribution over policies (to arbitrary precision) by choosing a suitable function A(G,O).
So, we can rewrite our problem as
(maxP[A|G,O]I(G;Y))=(maxP[π]I(π;Y))
In the context of our toy example: π has two possible values, (o↦o) and (o↦¯o). If π takes the first value, then Y is deterministically 0; if π takes the second value, then Y is deterministically 1. So, taking the distribution P[π] to be 50⁄50 over those two values, our generalized “channel capacity” is at least 1 bit. (Note that we haven’t shown that no P[π] achieves higher value in the maximization problem, which is why I say “at least”.)
Back to the general case: our conjecture can be expressed as
Δ=(maxP[π]I(π;Y))−(maxP[A]I(A;Y))≤H(O)
where the first optimization problem uses the factorization
P[π,O,Y]=P[π]P[O]P[Y|A=π(O),O]
and the second optimization problem uses the factorization
Eliminating G
The standard definition of channel capacity makes no explicit reference to the original message G; it can be eliminated from the problem. We can do the same thing here, but it’s trickier. First, let’s walk through it for the standard channel capacity setup.
Standard Channel Capacity Setup
In the standard setup, A cannot depend on O, so our graph looks like
… and we can further remove O entirely by absorbing it into the stochasticity of Y.
Now, there are two key steps. First step: if A is not a deterministic function of G, then we can make A a deterministic function of G without reducing I(G;Y). Anywhere A is stochastic, we just read the random bits from some independent part of G instead; Y will have the same joint distribution with any parts of G which A was reading before, but will also potentially get some information about the newly-read bits of G as well.
Second step: note from the graphical structure that A mediates between G and Y. Since A is a deterministic function of G and A mediates between G and Y, we have I(G;Y)=I(A;Y).
Furthermore, we can achieve any distribution P[A] (to arbitrary precision) by choosing a suitable function A(G).
So, for the standard channel capacity problem, we have P[G,A,Y]=P[G]P[A|G]P[Y|A], and we can simplify the optimization problem:
(maxP[A|G]I(G;Y))=(maxP[A]I(A;Y))
Note that this all applies directly to our conjecture, for the part where actions do not depend on observations.
That’s how we get the standard expression for channel capacity. It would be potentially helpful to do something similar in our problem, allowing for observation of O.
Our Problem
The step about determinism of A carries over easily: if A is not a deterministic function of G and O, then we can change A to read random bits from an independent part of G. That will make A a deterministic function of G and O without reducing I(G;Y).
The second step fails: A does not mediate between G and Y.
However, we can define a “Policy” variable
π:=(o↦A(G,o))
π is also a deterministic function of G, and π does mediate between G and Y. And we can achieve any distribution over policies (to arbitrary precision) by choosing a suitable function A(G,O).
So, we can rewrite our problem as
(maxP[A|G,O]I(G;Y))=(maxP[π]I(π;Y))
In the context of our toy example: π has two possible values, (o↦o) and (o↦¯o). If π takes the first value, then Y is deterministically 0; if π takes the second value, then Y is deterministically 1. So, taking the distribution P[π] to be 50⁄50 over those two values, our generalized “channel capacity” is at least 1 bit. (Note that we haven’t shown that no P[π] achieves higher value in the maximization problem, which is why I say “at least”.)
Back to the general case: our conjecture can be expressed as
Δ=(maxP[π]I(π;Y))−(maxP[A]I(A;Y))≤H(O)
where the first optimization problem uses the factorization
P[π,O,Y]=P[π]P[O]P[Y|A=π(O),O]
and the second optimization problem uses the factorization
P[A,O,Y]=P[A]P[O]P[Y|A,O]