Significance of Compression Rate Method

Summary: The significance of the Compression Rate Method (CRM) is that it justifies a form of empirical inquiry into aspects of reality that have previously resisted systematic interrogation. Some examples of potential investigations are described. A key hypothesis is discussed, and the link between empirical science and lossless data compression is emphasized.

In my previous post, the protagonist Sophie developed a modified version of the scientific method. It consists of the following steps:

  1. Obtain a large database T related to a phenomenon of interest.

  2. Develop a theory of the phenomenon, and instantiate the theory as a compression program.

  3. Test the theory by invoking the compressor on T and measuring the net codelength achieved (encoded data plus length of compressor).

  4. Given two rival theories of the phenomenon, prefer the one that achieves a shorter net codelength.

This modified version preserves two of the essential attributes of the traditional method. First, it employs theoretical speculation, but guides and constrains that speculation using empirical observations. Second, it permits Strong Inference by allowing the field to make decisive comparisons between rival theories.

The key difference between the CRM and the traditional method is that the former does not depend on the use of controlled experiments. For that reason, it justifies inquiries into aspects of empirical reality that have never before been systematically interrogated. The kind of scientific theories that are tested by the CRM depend on the type of measurements in the database target T. If T contains measurements related to physical experiments, the theories of physics will be necessary to compress it. Other types of data lead to other types of science. Consider the following examples:

  1. Set up a camera next to a highway, and record the stream of passing cars. To compress the resulting data, you will need to develop a computational understanding of the visual appearance of automobiles. You will need theories of hubcaps, windshields, license plates, car categories, and so on.

  2. Position some microphones in the tops of trees and start recording. A major source of variation in the resulting data will be bird vocalization. To compress the data, you will need to find ways to differentiate between bird songs and bird calls, tools to identify species-characteristic vocalizations, and maps showing the typical ranges of various species. In other words, this type of inquiry will be a computational version of the traditional study of bird vocalization carried out by ornithologists.

  3. Construct a database by using large quantities of English text. To compress this database you will need an advanced computational understanding of English. You will need dictionaries, rules of grammar, word-sense disambiguation tools, and, more generally, theories of linguistics.

  4. Convince Mark Zuckerberg to give you the Facebook image database. One obvious property of this dataset is that it contains an enormous number of faces. To compress it, you will need theories of the appearance of faces. These theories will be highly related to work on face modeling in graphics—see here for example.

  5. Generate a huge database of economic data such as home prices, interest and exchange rate fluctuations, business inventories and sales, unemployment and welfare applications, and so on. To compress this database, you will need theories of economics.

It should be emphasized that when in the above list it says “You will need theories of X”, this simultaneously means that “You can test and refine theories of X”, and “You can prove the superiority of your pet theory of X” by demonstrating the codelengths it achieves on an appropriate dataset. So if you are a linguist and you want to demonstrate the validity of X-bar theory, you build an X-bar compressor and test it on the large text database. If you are an economist and you want to prove the truth of Austrian Business Cycle Theory, you build ABCT into a compressor and invoke it on the economics database. If a theory can’t be packaged into a compressor for some real world dataset, then it’s probably not scientific anyway (more later on the problem of demarcation).

(It’s also worth noting the dedication to truth, and the simultaneous contempt for petty academic affiliation games, indicated by a rigorous adherence to the compression principle. If you develop a new theory of linguistics and use it to set a new record on the benchmark text database, I will hail you as a great linguist. I will publish your papers in my journal, nominate you for awards, and approve your grant applications. It does not matter if you are a teenage college dropout living with your parents.)

The inquiries described in the above list make an important implicit assumption, which can be called the Reusability Hypothesis:

The abstractions useful for practical applications are also useful for compression.

Thus one very practical application in relation to the Facebook database is the detection and recognition of faces. This application depends on the existence of a “face” abstraction. So the hypothesis implies that this face abstraction will be useful for compression as well. Similarly, with regards to the ornithology example, one can imagine that the ability to recognize bird song would be very useful to bird-watchers and environmentalists, who might want to monitor the activity, population fluctuations, and migration patterns of certain species. Here the Reusability Hypothesis implies that the ability to recognize bird-song will also be useful to compress the treetop sound database.

The linguistics example is worth examining because of its connection to a point made in the preface about overmathematization and the distinction of complex deduction vs. complex induction. The field of computational linguistics is highly mathematized, and one can imagine that in principle some complex mathematics might be useful to achieve text compression. But by far the simplest, and probably the most powerful, tool for text compression is just a dictionary. Consider the following sentence:

John went to the liquor store and bought a bottle of _______ .

Now, if a compressor knows nothing about English text, it will have to encode the new word letter-by-letter, for a cost of Nlog(26), where N is the length of the word (I assume N is encoded separately, at a basically fixed cost). But a compressor equipped with a dictionary will have to pay only log(Wn), where Wn is the number of words of length N in the dictionary. This is a substantial savings, since Wn is much smaller than 26^N. Of course, more advanced techniques, such as methods that take into account part of speech information, will lead to further improvement.

The point of the above example is that the dictionary is highly useful, but it does not involve any kind of complex mathematics. Compare a dictionary to the theory of general relativity. Both can be used to make predictions, and so both should be viewed (under my definition) as legitimate scientific theories. And they are both complex, but complex in opposite ways. GR is deductively complex, since it requires sophisticated mathematics to use correctly, but inductively simple, because it requires only a few parameters to specify. In contrast the dictionary is deductively simple, since it can be used by anyone who can read, but inductively complex, since it requires many bits to specify.

Another point made in the preface was that this approach involves empirical science as a core component, as opposed to being primarily about mathematics and algorithm-design (as I consider modern AI to be). Some people may be confused by this, since most people consider data compression to be a relatively minor subfield of computer science. The key realization is that lossless data compression can only be achieved through empirical science. This is because data compression is impossible for arbitrary inputs: no compressor can ever achieve compression rates of less than M bits when averaged over all M-bit strings. Lossless compressors work because they contain an implicit assertion about the type of data on which they will be invoked, and they will fail to achieve compression if that assertion turns out to be false. In the case of image compressors like PNG, the assertion is that, in natural images, the values of adjacent pixels are highly correlated. PNG can exploit this structure to achieve compression, and conversely, the fact that PNG achieves compression for a given image means that it has the assumed structure. In other words, PNG contains an empirical hypothesis about the structure of visual reality, and the fact that it works is empirical evidence in favor of the hypothesis. Now, this pixel-correlation structure of natural images is completely basic and obvious. The proposal, then, is to go further: to develop increasingly sophisticated theories of visual reality and test those theories using the compression principle.

Still, it may not be obvious why this kind of research would be any different from other work on data compression—people are, after all, constantly publishing new types of compression algorithms. The key difference is the emphasis on large-scale compression; this completely changes the character of the problem. To see why, consider the problem of building a car. If you only want to build a single car, then you just hack it together by hand. You build all the parts using machine tools and then fit them together. The challenge is to minimize the time- and dollar-cost of the manual labor. This is an interesting challenge, but the challenge of building ten million cars is entirely different. If you’re going to build ten million cars, then it makes sense to start by building a factory. This will be a big up-front cost, but it will pay for itself by reducing the marginal cost of each additional car. Analogously, when attempting to compress huge databases, it becomes worthwhile to build sophisticated computational tools into the compressor. And ultimately the development of these advanced tools is the real goal.