The title, “Utility Maximization = Description Length Minimization”, and likewise the bolded statement, “to “optimize” a system is to reduce the number of bits required to represent the system state using a particular encoding”, strike me as wrong in the general case, or as only true in a degenerate sense that can’t imply much. This is unfortunate, because it inclines me to dismiss the rest of the post.
Suppose that the state of the world can be represented in 100 bits. Suppose my utility function assigns a 0 to each of 2^98 states (which I “hate”), and a 1 to all the remaining (2^100 − 2^98) states (which I “like”). Let’s imagine I chose those 2^98 states randomly, so there is no discernible pattern among them.
You would need 99.58 bits to represent one state out of the states that I like. So “optimizing” the world would mean reducing it from a 100-bit space to a 99.58-bit space (which you would probably end up encoding with 100 bits in practice). While it’s technically true that optimizing always implies shrinking the state space, the amount of shrinking can be arbitrarily tiny, and is not necessarily proportional to the amount by which the expected utility changes. Thus my objection to the title and early statement.
It probably is true in practice that most real utility functions are much more constraining than the above scenario. (For example, if you imagine all the possible configurations of the atoms that make up a human, only a tiny fraction of them correspond to a living human.) There might be interesting things to say about that. However, the post doesn’t seem to base its central arguments on that.
Given what is said later about using K-L divergence to decompose the problem into “reducing entropy” + “changing between similar-entropy distributions”, I could say that the post makes the case for me: that a more accurate title would be “Utility Maximization = Description Length Minimization + Other Changes” (I don’t have a good name for the second component).
I actually think this is a feature, not a bug. In your example, you like 3⁄4 of all possible world states. Satisfying your preferences requires shrinking the world-space by a relatively tiny amount, and that’s important. For instance:
From the perspective of another agent in the universe, satisfying your preferences is very likely to incur very little opportunity cost for the other agent (relevant e.g. when making deals)
From your own perspective, satisfying your preferences is “easy” and “doesn’t require optimizing very much”; you have a very large target to hit.
While it’s technically true that optimizing always implies shrinking the state space, the amount of shrinking can be arbitrarily tiny, and is not necessarily proportional to the amount by which the expected utility changes.
Remember that utility functions are defined only up to scaling and shifting. If you multiply a utility function by 0.00001, then it still represents the exact same preferences. There is not any meaningful sense in which utility changes are “large” or “small” in the first place, except compared to other changes in the same utility function.
On the other hand, optimization-as-compression does give us a meaningful sense in which changes are “large” or “small”.
There is not any meaningful sense in which utility changes are “large” or “small” in the first place, except compared to other changes in the same utility function.
We can establish a utility scale by tweaking the values a bit. Let’s say that in my favored 3⁄4 of the state space, half the values are 1 and the other half are 2. Then we can set the disfavored 1⁄4 to 0, to −100, to −10^100, etc., and get utility functions that aren’t equivalent. Anyway, in practice I expect we would already have some reasonable unit established by the problem’s background—for example, if the payoffs are given in terms of number of lives saved, or in units of “the cost of the action that ‘optimizes’ the situation”.
Satisfying your preferences requires shrinking the world-space by a relatively tiny amount, and that’s important. [...] satisfying your preferences is “easy” and “doesn’t require optimizing very much”; you have a very large target to hit.
So the theory is that the fraction by which you shrink the state space is proportional (or maybe its logarithm is proportional) to the effort involved. That might be a better heuristic than none at all, but it is by no means true in general. If we say I’m going to type 100 digits, and then I decide what those digits are and type them out, I’m shrinking the state-space by 10^100. If we say my net worth is between $0 and $10^12, and then I make my net worth be $10^12, I’m shrinking the state-space (in that formulation of the world) by only 10^12 (or perhaps 10^14 if cents are allowed); but the former is enormously easier for me to do than the latter. In practice, again, I think the problem’s background would give much better ways to estimate the cost of the “optimization” actions.
(Edit: If you want an entirely self-contained example, consider: A wall with 10 rows of 10 cubby-holes, and you have 10 heavy rocks. One person wants the rocks to fill out the bottom row, another wants them to fill out the left column, and a third wants them on the top row. At least if we consider the state space to just be the positions of the rocks, then each of these people wants the same amount of state-space shrinking, but they cost different amounts of physical work to arrange.)
I’m guessing that the best application of the idea would be as one of the basic first lenses you’d use to examine/classify a completely alien utility function.
If you want an entirely self-contained example, consider: A wall with 10 rows of 10 cubby-holes, and you have 10 heavy rocks. One person wants the rocks to fill out the bottom row, another wants them to fill out the left column, and a third wants them on the top row. At least if we consider the state space to just be the positions of the rocks, then each of these people wants the same amount of state-space shrinking, but they cost different amounts of physical work to arrange.)
I think what’s really going on in this example (and probably implicitly in your intuitions about this more generally) is that we’re implicitly optimizing only one subsystem, and the “resources” (i.e. energy in this case, or money) is what “couples” optimization of this subsystem with optimization of the rest of the world.
Here’s what that means in the context of this example. Why is putting rocks on the top row “harder” than on the bottom row? Because it requires more work/energy expenditure. But why does energy matter in the first place? To phrase it more suggestively: why do we care about energy in the first place?
Well, we care about energy because it’s a limited and fungible resource. Limited: we only have so much of it. Fungible: we can expend energy to gain utility in many different ways in many different places/subsystems of the world. Putting rocks on the top row expends more energy, and that energy implicitly has an opportunity cost, since we could have used it to increase utility in some other subsystem.
More generally, the coherence theorems typically used to derive expected utility maximization implicitly assume that we have exactly this sort of resource. They use the resource as a “measuring stick”; if a system throws away the resource unnecessarily (i.e. ends up in a state which it could have gotten to with strictly less expenditure of the resource), then the system is suboptimal for any possible utility function for which the resource is limited and fungible.
Tying this all back to optimization-as-compression: we implicitly have an optimization constraint (i.e. the amount of resource) and likely a broader world in which the limited resource can be used. In order for optimization-as-compression to match intuitions on this problem, we need to include those elements. If energy can be expended elsewhere in the world to reduce description length of other subsystems, then there’s a similar implicit bit-length cost of placing rocks on the top shelf. (It’s conceptually very similar to thermodynamics: energy can be used to increase entropy in any subsystem, and temperature quantifies the entropy-cost of dumping energy into one subsystem rather than another.)
Hmm. If we bring actual thermodynamics into the picture, then I think that energy stored in some very usable way (say, a charged battery) has a small number of possible states, whereas when you expend it, it generally ends up as waste heat that has a lot of possible states. In that case, if someone wants to take a bunch of stored energy and spend it on, say, making a robot rotate a huge die made of rock into a certain orientation, then that actually leads to a larger state space than someone else’s preference to keep the energy where it is, even though we’d probably say that the former is costlier than the latter. We could also imagine a third person who prefers to spend the same amount of energy arranging 1000 smaller dice—same “cost”, but exponentially (in the mathematical sense) different state space shrinkage.
It seems that, no matter how you conceptualize things, it’s fairly easy to construct a set of examples in which state space shrinkage bears little if any correlation to either “expected utility” or “cost”.