Alfred Harwood comments on Proof Explained: Touchette-Lloyd Theorem

Alfred Harwood 7 May 2026 11:23 UTC
2 points
0
This sounds right. Since expected utility is linear, the expected utility of any policy will be a weighted sum of the expected utilities of all possible (action, initial state) pairs. One of these pairs (call it (a_0, x_0)) will have the highest expected change in utility after going through the dynamics so you can pick the initial input x_0 and have a deterministic blind policy of picking a_0. This will be a blind policy and by definition will have the highest possible change in utility. This isn’t true with entropy since entropy is convex, not linear.

I encountered this issue when trying to prove an equivalent version of the TL theorem for utility maximization, but didn’t get beyond it. Of course, if you can’t choose the input distribution, then having mutual information with the input should still help you maximize your expected utility, but I couldn’t find an elegant/general way to express this fact!