Proof Explained: Touchette-Lloyd Theorem
This is a sequel to our previous post on the Touchette-Lloyd theorem[1]. The previous post contained some introductory material and motivation for the theorem. Here, we will walk through the proof of the theorem and explore its applications in a few worked examples. It isn’t strictly necessary to read that post before this one, but we recommend it if anything in this post seems unclear or if you would like some more background.
Recap and Setup
The theorem concerns three random variables. These are: the input ‘environment’ variable denoted by
The way in which the action variable
The theorem statement is the inequality
which relates the entropy change
The Proof
We will use the subscript
Let us consider the performance of a policy like this in producing a distribution with a low final entropy. In particular, we can quantify this using the difference between the initial environmental entropy
Quantifying entropy reduction in this way is intuitive and captures the fact that (other things being equal) it is more impressive for a agent or controller to achieve a low output entropy from a high starting entropy than to start with a low entropy and maintain it.
For any given initial distribution, there is a maximum
where
We can ask: is there a combination of initial distribution and sighted policy which can outperform this value of
For a general ‘sighted’ policy, the action
Notice that we can use Bayes rule to re-write the conditional probabilities relating
where
Note that this doesn’t imply that the causal relationship between
Substituting this back in and rewriting the order of things, we can write the output distribution for a general policy as
Notice that this is a weighted average over probability distributions for each
Interestingly, (and this is the key insight of the proof) we can think of this conditional output distribution for a sighted policy as the output distribution of a blind policy which always selects
Furthermore, since
Recall the definition of conditional entropy which allows us to write:
This allows us to simplify this equation and obtain
To turn this bound on conditional entropies into a bound on
Now, note two things. First, since
Second, note that
which is the result!
What does this result tell us?
From the point of view of the agent structure problem, we can view this result as a selection theorem which tells us that, if we select for (or observe) a policy which achieves an entropy reduction
Examples of different dynamics
To gain some intuition for this result, we will explore its implications through a few examples. These can have very different feels, but in all cases, the dynamics are fully specified by defining the conditional distribution
1. Entropy decreasing dynamics
First, we will consider an entropy decreasing dynamics. In this example the entropy of the environment will naturally tend to decrease, without help from the policy. A simple example of such a dynamics is one where all input states are mapped to the same output, regardless of the action taken by the policy. If we consider each state to be labelled by a 5-bit string the following probability distribution has this property:
If we only observed environmental states, we would see that regardless of the entropy of the initial distribution over input strings, the output distribution would have zero entropy. Naively, one might attribute this entropy reduction to an effective policy using information about its environment. But in fact, the entropy reduction is a property of the environmental dynamics, not the policy. This can be seen by the fact that no policy, blind or sighted, can achieve a greater entropy change than that achieved by a blind policy. For this dynamics,
In words: the output state is equal to the input state, unless the action taken is
2. Entropy non-increasing dynamics
Let’s return to the example given in the previous post, where the policy has to guess the correct 5-bit string. The dynamics of this example is deterministic and characterised by the following conditional probabilities
In words: the output state is
First, we will find
First note that if
Note also that the above argument can be extended to show that, for any initial distribution, the blind policy will always achieve a greater entropy reduction when the probabilities of 00000 and 11111 are equal compared to when they are not.
Now, notice that any input distribution can be viewed as a mixture between a distribution over the domain
where
where
Now let us consider the output entropy for this distribution, under a blind policy. With probability
By subtracting
The terms involving distributions over the domain excluding
Suppose we now observed a policy which achieved an output entropy of 2.83 bits when fed a maximum entropy input distribution with
The entropy reduction achieved by this policy is
this means that our policy must have at least
First, note that in the derivation of the inequality, we used the fact that
Since
In our case, while
So in fact, we can improve on the bound provided by the original Touchette-Lloyd inequality, provided that we are willing to replace
Another way to increase the lower bound on
3. Entropy increasing dynamics
We described the dynamics above as ‘entropy non-increasing’, since it is impossible for any policy to lead to an output distribution with a higher entropy than the input. We can also consider dynamics where entropy can increase and a policy has to be more proactive in fighting this entropy increase.
Consider the following dynamics, again in a situation where both environmental state and the action are described by 5-bit strings. If the action matches the environmental state (ie. if
Using a similar argument to the one in the previous section, we can find that the maximum blind policy entropy change is 0 bits, achieved when the initial distribution (and the corresponding action string) is completely concentrated on one environmental state. This is because any initial distribution which allows the policy to guess wrong will be punished by a high-entropy final state leading to a negative
Consider the sighted policy from the previous section which can ‘see’ the first four bits and chooses the action
Similarly to the previous section, we can improve on this bound by changing the initial distribution. If we consider the case where the initial distribution is over all strings which end in the bit ‘1’ then this sighted policy chooses the correct action every time and reduces the final entropy to zero. This corresponds to
Moving to an environment which was more hostile to entropy reduction leads to a greater discrepancy between blind and sighted policies, this in turn allows the bound on mutual information provided by the Touchette-Lloyd inequality to be made tighter than in the ‘entropy non-increasing’ environment of the previous example.
Incidentally, this environment also highlights why, when calculating
If the initial distribution has intermediate entropy (ie. nonzero but not maximal)
(NOTE: THIS INEQUALITY IS NOT TRUE!)
But we have just argued that a sighted policy can achieve
On the other hand, if we define
Conclusions & Further Work
We hope that the Touchette-Lloyd theorem is clear to you after reading this. We think that it is pretty interesting result from an Agent Foundations point of view and are interested in seeing if we can extend it. We have some ideas for a few avenues for further research.
Re-frame the result using expected utility. Reducing entropy and maximizing expected utility are (in some regards) similar in that they are both forms of optimization. If one policy achieves a particular expected utility (for some utility function over
) can we ascribe some lower bound on the mutual information between its actions and the environmental state?Extend the result to a non-Markovian environment. We assumed here that the environment was Markovian so that X could be treated as a random variable independent of its previous value. But most interesting environments do not have this property. Does a similar result hold when a policy must ‘remember’ previous environmental states to choose optimally at a given timestep?
Policies which ‘model’ the dynamics, not just the state. The Touchette-Lloyd theorem broadly tells us when a policy has ‘knowledge’ of the environmental state. But having mutual information between the state and does not necessarily imply successful entropy reduction. In order to successfully reduce entropy, one must also choose actions which are appropriate for the dynamics of the environment. Can we prove a similar which captures whether a policy is modelling the dynamics of the system?
Prove an equivalent result using K-complexity. Entropy reduction is one hallmark of an optimizer. But if a system reduces in complexity or description length, it could also reasonably be said to be ‘being optimized’. Can we prove a similar theorem involving the reduction of K-complexity, instead of entropy?
- ^
That is, theorem 10 from the paper Information-theoretic approach to the study of control systems.
- ^
In control theory terminology, these families of policies are known as ‘open loop’ and ‘closed loop’ respectively. But as a personal preference we find the names ‘blind’ and ‘sighted’ more evocative so we will use them instead.
- ^
Touchette and Lloyd describe this as “the fact that a closed-loop controller is formally equivalent to an ensemble of open-loop controllers acting on the conditional supports
instead of ” (where they are using to represent a specific controller action, equivalent to our )
In the utility maximization as description length minimization setting, we get something kinda weird:
be the cross entropy , and likewise . Let and let
is actually just equal to ? So you actually just get
Say that X, Y are random variables “of the same type” (that is, outcomes are in the same space), and say we fix a distribution M (corresponding to the utility function) over that space. Let
Then you get the analogous version of Touchette-Lloyd… except, notice how
that is, you can’t get more improvement than the best blind policy & input distribution gets‽
I sorta suspect it’s just because allowing the input distribution to vary in your maximum is allowing a lot of “cheating”.
As an example, take something like the password example from John’s “How many bits of optimization does one bit unlock” post. Here, the best blind policy is going to “cheat” by imagining that the password was some specific value, and then taking the action that gives that value to the “lock”. So it just says that you can’t do better than if you knew the password.
This sounds right. Since expected utility is linear, the expected utility of any policy will be a weighted sum of the expected utilities of all possible (action, initial state) pairs. One of these pairs (call it (a_0, x_0)) will have the highest expected change in utility after going through the dynamics so you can pick the initial input x_0 and have a deterministic blind policy of picking a_0. This will be a blind policy and by definition will have the highest possible change in utility. This isn’t true with entropy since entropy is convex, not linear.
I encountered this issue when trying to prove an equivalent version of the TL theorem for utility maximization, but didn’t get beyond it. Of course, if you can’t choose the input distribution, then having mutual information with the input should still help you maximize your expected utility, but I couldn’t find an elegant/general way to express this fact!
My naive attempt to make the algorithmic-information-theory variant of the theorem is (I’m pretty sure) false.
That is: I’m pretty sure there’s some f, A, X such that
Sketch: Given some special A’, X’, define f(X,A) = s if X,A=X’,A’ else X, for some easily compressible s; and require that K(A’) - K(A’|X’) > ϵ. Then if the inputs have less than epsilon of algorithmic mutual information, you’d know you didn’t have the special inputs, and so f would spit out X and thus not reduce the complexity.
If we chose our s so that K(s) < K(X’|A’), then we’ll get a counterexample.
I’m pretty sure you should be able to construct X’, A’ that satisfy those two inequalities, but I was struggling with some of the details.
Entropy reduction is always bounded by a deterministic policy’s entropy reduction.
In particular, entropy reduction for a blind policy is a convex function on the convex set ΔA of blind policies; so an extreme point will always be a maximum.
Likewise, we can get a weaker form of the bound: ΔH ≤ ΔH_f ≤ Δ H^{max}{blind} + H_f (A)
where f is an optimal (or at least better than the actual policy) sighted deterministic policy, and H_f means we take the entropy under the distribution using f as the policy.
This doesn’t seem to make the proofs easier, but it’s an interesting fact anyways