Say that X, Y are random variables “of the same type” (that is, outcomes are in the same space), and say we fix a distribution M (corresponding to the utility function) over that space. Let be the cross entropy , and likewise . Let and let
Then you get the analogous version of Touchette-Lloyd… except, notice how is actually just equal to ? So you actually just get
that is, you can’t get more improvement than the best blind policy & input distribution gets‽
I sorta suspect it’s just because allowing the input distribution to vary in your maximum is allowing a lot of “cheating”.
As an example, take something like the password example from John’s “How many bits of optimization does one bit unlock” post. Here, the best blind policy is going to “cheat” by imagining that the password was some specific value, and then taking the action that gives that value to the “lock”. So it just says that you can’t do better than if you knew the password.
This sounds right. Since expected utility is linear, the expected utility of any policy will be a weighted sum of the expected utilities of all possible (action, initial state) pairs. One of these pairs (call it (a_0, x_0)) will have the highest expected change in utility after going through the dynamics so you can pick the initial input x_0 and have a deterministic blind policy of picking a_0. This will be a blind policy and by definition will have the highest possible change in utility. This isn’t true with entropy since entropy is convex, not linear.
I encountered this issue when trying to prove an equivalent version of the TL theorem for utility maximization, but didn’t get beyond it. Of course, if you can’t choose the input distribution, then having mutual information with the input should still help you maximize your expected utility, but I couldn’t find an elegant/general way to express this fact!
In the utility maximization as description length minimization setting, we get something kinda weird:
be the cross entropy , and likewise . Let and let
is actually just equal to ? So you actually just get
Say that X, Y are random variables “of the same type” (that is, outcomes are in the same space), and say we fix a distribution M (corresponding to the utility function) over that space. Let
Then you get the analogous version of Touchette-Lloyd… except, notice how
that is, you can’t get more improvement than the best blind policy & input distribution gets‽
I sorta suspect it’s just because allowing the input distribution to vary in your maximum is allowing a lot of “cheating”.
As an example, take something like the password example from John’s “How many bits of optimization does one bit unlock” post. Here, the best blind policy is going to “cheat” by imagining that the password was some specific value, and then taking the action that gives that value to the “lock”. So it just says that you can’t do better than if you knew the password.
This sounds right. Since expected utility is linear, the expected utility of any policy will be a weighted sum of the expected utilities of all possible (action, initial state) pairs. One of these pairs (call it (a_0, x_0)) will have the highest expected change in utility after going through the dynamics so you can pick the initial input x_0 and have a deterministic blind policy of picking a_0. This will be a blind policy and by definition will have the highest possible change in utility. This isn’t true with entropy since entropy is convex, not linear.
I encountered this issue when trying to prove an equivalent version of the TL theorem for utility maximization, but didn’t get beyond it. Of course, if you can’t choose the input distribution, then having mutual information with the input should still help you maximize your expected utility, but I couldn’t find an elegant/general way to express this fact!