Toward Interoperability of Minimal Programs

Assumed background: Kolmogorov complexity and Solomonoff induction.

Suppose I have some data , and I go looking for the models (i.e. programs) which best compress that data. I find two different programs, and , which both reproduce the data using approximately the same number of bits, and that seems to be roughly the best compression possible. On examination, I find that the two models do totally different things internally.

It would be really nice if I could provably construct a third program, , which in some sense “combines the internal structure” of the two programs and , while still achieving approximately the same compression. This would be a result in the general cluster of natural abstraction and interoperable semantics. Very roughly speaking, it would say that if a human and an alien both have approximately-best-compressing models in some domain, but their models have totally alien internal structure, then we can construct a new model which finds both of the original models intelligible, while still achieving basically-optimal compression.

I don’t have a perfect theorem like that with all the kinks worked out. But I can give some math which seems like it would allow a result along those lines, with the right setup.

Some K-Complexity Math

Suppose I have two approximate best compressions of data . Let’s give them just a little internal structure:

  • The first compression consists of a self-contained program for generating an intermediate string , followed by a second self-contained program for generating the data from .

  • Second compression consists of a self-contained program for generating an intermediate string , followed by a second self-contained program for generating the data from .

Quantitatively, using the notation for Kolmogorov complexity (K-complexity), this means:

The existence of our two approximately-best compressions means that both of these approximations must hold.

From there, we do a little math, just using standard properties of K-complexity. (The main properties we use here derive straightforwardly from the Chain Rule. Note that the approximations therefore suppress terms of order .)

implying

Intuitively, this says: since is part of an approximate minimal compression of the data, it has approximately zero K-complexity given the data.

… and together, and given the data could be specified by just appending the two program which generates from the data and the program which generates from the data, both of which are very short (approximately zero, to within terms). So:

Then we chain rule back the other way:

implying

In other words, there exists an approximately-shortest program for the data which consists of a self-contained program to generate and , and then a second self-contained program to generate the data from and . That’s the interesting result.

If the intermediates and are independent of each other, i.e. neither is more compressible given the other, then we can go one more step: . In that case, there exists a shortest program which consists of a self-contained program to generate , another self-contained program to generate , and a third self-contained program to generate the data from and . In that case, our new program can straight-up reuse the and programs from the two original models; the new program directly shares structure with the originals and could even interoperate with them.

Summary

  • We start with two different approximate best compressions of some data, and .

  • Each compression consists of a self-contained program to generate an intermediate string , followed by another self-contained program to generate the data given .

  • Then: there exists another approximate best compression consisting of a self-contained program to generate both and , followed by another self-contained program to generate the data given and .

  • If and are independent (i.e. no nontrivial joint compression), then the new best compression can directly reuse the and generating programs from the two original models.

By itself, this result is kind of toy. Approximate best compressions of lots of real-world data would probably start by defining whole new languages or libraries, which would then be used by component programs later on, so the component programs would not be standalone. On the other hand, there might be other tricks to handle that sort of structure, like e.g. Solomonoff natural latents. The hope would be that we could figure out a few such foundational tricks, and then compose them to handle more complicated programs.

Acknowledgement: David wasn’t involved in this particular post, but it did bubble out of stuff I’ve done with him.