# DanB comments on Utility Maximization = Description Length Minimization

• In my Phd thesis I explored an extension of the compression/​modeling equivalence that’s motivated by Algorithmic Information Theory. AIT says that if you have a “perfect” model of a data set, then the bitstream created by encoding the data using the model will be completely random. Every statistical test for randomness applied to the bitstream will return the expected value. For example, the proportion of 1s should be 0.5, the proportion of 1s following the prefix 010 should be 0.5, etc etc. Conversely, if you find a “randomness deficiency”, you have found a shortcoming of your model. And it turns out you can use this info to create an improved model.

That gives us an alternative conceptual approach to modeling/​optimization. Instead of maximizing a log-likelihood, take an initial model, encode the dataset, and then search the resulting bitstream for randomness deficiencies. This is very powerful because there is an infinite number of randomness tests that you can apply. Once you find a randomness deficiency, you can use it to create an improved model, and repeat the process until the bitstream appears completely random.

The key trick that made the idea practical is that you can use “pits” instead of bits. Bits are tricky, because as your model gets better, the number of bits goes down—that’s the whole point—so the relationship between bits and the original data samples gets murky. A “pit” is a [0,1) value calculated by applying the Probability Integral Transform to the data samples using the model. The same randomness requirements hold for the pitstream as for the bitstream, and there are always as many pits as data samples. So now you can define randomness tests based on intuitive contexts functions, like “how many pits are in the [0.2,0.4] interval when the previous word in the original text was a noun?”

• Interesting framing. Do you have a unified strategy for handling the dimensionality problem with sub-exponentially-large datasets, or is that handled mainly by the initial models (e.g. hidden markov, bigram, etc)?

• I’m not sure exactly what you mean, but I’ll guess you mean “how do you deal with the problem that there are an infinite number of tests for randomness that you could apply?”

I don’t have a principled answer. My practical answer is just to use good intuition and/​or taste to define a nice suite of tests, and then let the algorithm find the ones that show the biggest randomness deficiencies. There’s probably a better way to do this with differentiable programming—I finished my Phd in 2010, before the deep learning revolution.