K-complexity is silly; use cross-entropy instead

Short version

The K-complexity of a function is the length of its shortest code. But having many many codes is another way to be simple! Example: gauge symmetries in physics. Correcting for length-weighted code frequency, we get an empirically better simplicity measure: cross-entropy.

Long version

Suppose we have a (Turing-complete) programming language , and a function of the type that can be named by .

For example, might be the function that takes (as input) a list of numbers, and sorts it (by producing, as output, another list of numbers, with the property that the output list has the same elements as the input list, but in ascending order). Within the programming language , there will be lots of different programs that represent , such as a whole host of implementations of the bubblesort algorithm, and a whole host of implementations of the quicksort algorithm, and a whole host of implementations of the mergesort algorithm. Note the difference between the notion of the function (“list sorting”) and the programs that represent it (bubblesort, quicksort, mergesort).

Recall that the Kolmogorov complexity of in the language is the length of the shortest program that represents :

This is often touted as a measure of the “complexity” of , to the degree that people familiar with the concept often (colloquially) call a function “simple” precisely to the degree that it has low K-complexity.

I claim that this is a bad definition, and propose the following alternative:

where denotes logarithm base , aka the negative of the (base 2) logarithm, and denotes exponentiation base , aka the reciprocal of the (base 2) exponential.

(Note that we could just as easily use any other base . would be a particularly natural choice, as usual. Here I’m using , both because it fits with measuring the lengths of our programs in terms of bits, and because it keeps the numbers whole in our examples.)

Below, I’ll explore this latter definition, and its elegance and theoretical superiority. Then I’ll point out that our own laws of physics seem to have (comparatively) high K-complexity and low alt-complexity, thus giving empirical justification for my “correction”.

Investigation

A first observation is that the alt-complexity and the K-complexity agree whenever there is at most one program in that represents . If there’s no program, then both equations are (positive) infinite. If there’s exactly one program representing , then will be the only term in the and the only term in the , so the first definition will yield whereas the second definition will yield , but and are inverses, so both definitions yield . Thus, the definitions only differ when has multiple programs in the language .

In that case, the alt-complexity will be lower than the K-complexity, as you may verify. As a simple example, suppose there are two different programs that represent , both of length 17. Then the K-complexity of is 17 bits, whereas the alt-complexity of is

According to alt-complexity, having two programs (of the same length) that represent is just as good as having a single program that’s one bit shorter. By a similar token, having 256 programs that are each bits long, is (according to alt-complexity but not K-complexity) just as good as having a single program that’s bits long.

Why might this make sense? Well, suppose you’re writing a program that (say) renders a certain 3D scene. You have to make some arbitrary choices to do the rendering, like choosing which point is the 0 point (the center of the room? one of the eight corners?), and how to orient the axes (mathematicians and video game devs have different Y-axis conventions, last I checked), and so on. Should these arbitrary choices count against the complexity of your code? They definitely make your code longer, but it’s not obvious that they make your code “fundamentally more complicated” in the way that extra if statements and for loops do. In attempts to formalize this intuition, we might object: yes, we had to make some arbitrary choices in order to render the scene, but these choices don’t matter: for every possible choice of these conventions, there’s a similarly-short program that renders my scene.

(Now, if you want to talk about the complexity of your code relative to a library that has hard-coded in some particular conventions, then it will be simpler to follow the same conventions, because only in that case do you get to avoid writing the conversions. But if we take your scene-rendering code and the surrounding libraries as a whole package, then the point holds: for every choice of conventions, there’s similarly-simple code that renders the same scene.)

Cross-entropy

A second observation is that the alt-complexity of in the language is precisely the cross-entropy of the distribution relative to the distribution .

“Wait, what?” you protest. ” is a language, and is a function!”

Yep, but we can promote both and to distributions in natural ways, and once we do, we see that this notion of alt-complexity is exactly cross-entropy.

We turn a programming language into a probability distribution on functions by saying that the probability of a function is the probability that a randomly-generated program is a representative of :

(Note that this equation (probably?) works when you’ve formalized your notion of “programming language” and the corresponding notion of “length” in convenient ways. If you didn’t pick the most convenient definitions, you might need to tweak this equation, but I don’t expect those tweaks to change the story much.)

Equivalently, the probability of a function in a programming language is taken to be the probability that Solomonoff induction assigns to when using the language .

We turn a function into a probability distribution on functions by taking the probability distribution that assigns 100% probability to (and 0% probability to every other function).

Then the cross-entropy of relative to is defined as

which is just the alt-complexity of (in the language ).

So if our notion of “complexity” takes all the programs for into account, instead of just the shortest one, then it says that the complexity of a function is just the cross-entropy of Solomonoff induction when is the ground truth.

In other words: the alt-complexity of is the degree of surprise (in bits) that Solomonoff induction (predicting a function, using your chosen programming language) would experience if were true.

Solomonoff induction

Solomonoff induction also offers a vote in favor of alt-complexity over K-complexity.

I occasionally hear people say that Solomonoff induction concentrates probability-mass on the hypothesis with lowest K-complexity among those hypotheses that fit the data. This is false. Solomonoff induction concentrates probability-mass on the hypothesis with lowest alt-complexity among those hypotheses that fit the data.

Low alt-complexity happens to coincide with low K-complexity pretty often, but whenever it doesn’t, Solomonoff induction prefers alt-simplicity to K-simplicity. To see this, consider the case where a single 4-bit code predicts a 0, and three 5-bit codes predict a 1, and all other codes have been ruled out by the data (or are too long to make any difference to the current calculation). In this case, observe that the probability Solomonoff induction places on 1 exceeds the probability that Solomonoff induction places on 0, precisely because the alt-simplicity of 1 exceeds the alt-simplicity of 0.[1]

Elegance

When I noticed the intuition that you shouldn’t be dinged for making an arbitrary choice-of-convention in your code, if every convention yields a similarly-short program, I had the thought that there should obviously be some way to combine all of the lengths of all of the programs that represent a given function. The “nlog-sum-rexp” formula above is the result of figuring out the most natural-feeling weighting. Like, you can’t just sum together all the lengths, as that would penalize a function for having more programs, which isn’t what we want. So, follow the intuition that two 17-bit programs should be just about as good as one 16-bit program, and then ask how we must be combining lengths. Program-lengths are (canonically) measured in bits, so by exponentiating them we get something that’s intuitively more like the “fraction of codespace” that’s taken up by that program (this intuition is exact if we’re using prefix-free codes). These are non-overlapping, and so we can sum them, and then take the nlogarithm to get back to bits of complexity.

As an interesting historical note, it was only after that sequence of thoughts that I noticed that I’d re-invented both Solomonoff induction and (a special case of) cross-entropy.

Of course, having familiarity with both those concepts probably helped me have the above sequence of thoughts as rapidly as I did, but this still felt to me like evidence of elegance: K-complexity is clearly a bit inelegant, and if you fix it in the way that it’s begging to be fixed, you land squarely on other useful battle-hardened concepts.

Empiricism

But perhaps this evidence of elegance is lost on you (as someone who didn’t first work out the obvious correction to K-complexity, and then notice with personal delight that you’d re-invented cross-entropy; or as someone who doesn’t have pre-existing respect for Solomonoff induction). Or perhaps you just delight in more overwhelming evidence.

In that case, I direct your attention to the laws of physics.

As you may have heard, the laws of physics are relativistic. The most naive method of describing a physical configuration involves some arbitrary choices, such as a choice of where the origin is (is it between my eyes, or between yours?), and a choice of orientation, and a choice of velocities. That’s a lot of extra data! Fortunately for us, relativity guarantees that the universe can be described no matter which conventions we choose. The laws of physics are the same no matter what location we declare to be the “origin”, and no matter what trajectory we claim is “at rest”. A naive computer program that simulates classical relativistic physics makes all sorts of arbitrary choices, but for every way of making those choices there’s a way of describing the universe such that the program still works, and so those arbitrary choices don’t really count against us.

Or, at least, they don’t really count against us if we expect our universe to have low alt-complexity. Contentless choices of convention do count against K-complexity!

“Hold on”, you might protest. “In a naive representation of a classical relativistic universe, you have to choose an origin-point. But there are other representations that don’t contain the extra coordinate. For example, instead of saying each particle’s position relative to some hallucinated “origin point”, I can tell you their positions relative to each other, and thus remove the redundancy. Perhaps this sort of thing can always be done when there’s a redundancy, such physics does in fact have a K-complexity that’s about as low as its alt-complexity.”

That’s a fine hypothesis! Having stated it, you’re presumably prepared to update against it, in the face of contrary evidence. Of which there’s plenty.

For one thing, even if you’re giving the positions of particles relative to other particles (rather than to an origin), you still have a whole host of arbitrary choices to make, such as what order to walk through the particles in when you’re producing all these relative positions.

For another thing, the laws of physics just seem to be really very adamant about the idea that reality can only be specified relative to some arbitrary choices-of-convention, with the property that physics works no matter which arbitrary choice you make, but that you do have to make some choice. If I understand my physics correctly, this is more-or-less one of the core ideas at the heart of gauge theory (though, caveat, my grasp on gauge theory is more tenuous than my grasp on basic classical and quantum mechanics).

Like, very roughly speaking, when you start trying to make quantum mechanics play nice with relativity, the laws of physics glance at the idea that you might need to specify an origin point, scoff, and then ask you to to also specify a continuous function from spacetime to the unit circle. As a warm-up. (It’s the U(1) part of the SU(3) × SU(2) × U(1) symmetry, if I understand correctly.)

And—if I understand correctly, and again noting that my grasp on gauge theory is somewhat tenuous—you can’t get away from picking some gauge, and this fact has real physical effects, such as photons.[2]

If I’m understanding it correctly, physics really goes ham on the idea that you’ve gotta make a whole lot of arbitrary choices before you can start describing a physical system at all. These choices wash out (in the sense that you can describe your system similarly-easy ~no matter which choice you make), which is why I call them arbitrary, but it sure does look like physics itself is giving a strong vote in favor of alt-complexity over K-complexity.

In other words: insofar as you buy Occam’s razor, which says that reality itself is supposed to turn out to be ‘simple’, reality itself gets some say in what counts as ‘simple’. And it sure looks to me like reality has a lot more alt-simplicity than it has K-simplicity. And so K-complexity is just not a very good measure of the “simplicity” that actual physics actually possesses.


ETA: Various folk in the comments have pointed out that the K-complexity and the alt-complexity differ by an amount bounded by a constant (that depends on but not on ). This is cool, and a stronger relationship than I had known, but (to state the obvious) it doesn’t much undermine the point that our laws of physics seem to prefer alt-complexity to K-complexity.

For instance, suppose (very generously) that the constant is 10 bits. This would mean that no reality can get more than 1024 times as much probability-mass concentrated in an enormous cluster of long programs (such as “our laws of physics, with an arbitrary gauge hard-coded”) than in some single shorter program (perhaps one that just iterates through all gauges, performing some kind of correctness check).

That’s interesting, but it leaves open an empirical question of whether the actual world we find ourselves in is in fact one with most of its support coming from a single short program. Like, at least 1/​1025th of reality’s support must come from a single short program, but some realities will have 99% of their support coming from a single short program, whereas others will flirt with the ~0.1% boundary. Which sort of reality are we dealing with?

When we look around, we see a reality that seems to be better-described by an enormous regular cluster of long programs, by a hefty factor (perhaps 512, in this example).

We could have seen reality be almost as K-simple as it is alt-simple, but in fact it looks to be significantly more K-complex than it is alt-complex. That seems to me like empirical evidence validating the theory (and intuition) that we should think in terms of program-clusters rather than in terms of shortest-programs.

Or in other words: sure, the K-complexity and the alt-complexity of the universe differ by at most an additive constant that depends only on and not on the universe, but “you will get no more than bits of empirical evidence favoring alt-complexity over K-complexity” leaves open the possibility of getting quite a few bits, and that looks to me like what happened.


Conjectures

I’ve long felt that algorithmic information theory is kinda janky and annoying. I conjecture that this is because it’s using K-complexity instead of alt-complexity as one of its central notions.

For instance, recall that Solomonoff induction doesn’t converge on the hypothesis with lowest K-complexity. It converges on the hypothesis with lowest alt-complexity instead. If you want to prove something like “Solomonoff induction converges on the hypothesis with lowest K-complexity”, you’ll have to construct some sort of awkward repeated game that’s somehow driving a wedge between the shortest program and all the other competing programs, and then argue something about how the difference eventually gets arbitrarily small, or something. Which is tedious and kinda insane, compared to the theorem saying that Solomonoff induction converges on the hypothesis with lowest alt-complexity, which is practically by-definition.

I haven’t looked much at algorithmic information theory texts since developing the idea of alt-complexity, but if my vague memories serve, lots of the theorems in algorithmic information theory felt to me like they were tedious and kinda insane in this way. I think there’s a decent chance that a variety of theorems in algorithmic information theory could be cleaned up by replacing K-complexity with alt-complexity.

(This should at least be true of theorems relating “algorithmic entropy” to the notion of entropy used in other parts of information theory and in physics. alt-complexity has a very simple and clean relationship to entropy, as noted above; the relationship between K-complexity and entropy is much more fraught, and probably requires some repetition-based wedge-driving crap.)

EDIT: see interstice’s comment here; K-complexity and alt-complexity differ by at most a constant (that depends only on and not on ), so the conversions can’t be that bad, and I now doubt that the difference accounts for as much annoyance as I originally conjectured. (Though it still seems plausible to me that various theorems are obscured by the use of K-complexity instead of alt-complexity.)

Acknowledgements

I have been using a vague/​implicit concept of alt-complexity in my own notes for a number of years, but didn’t develop the idea explicitly until a recent conversation with Adam Scherlis right here on LessWrong a couple months ago. (You’ll have to click ‘see in context’ to see the whole comment thread.)

As an aside, I recommend reading that comment thread if you’re interested in watching two people with different intuitions go back-and-forth until they successfully transmit their points to each other, and come away with more understanding (and cooler ideas) than either went in with.

(The primary thing we figured out together was not so much alt-complexity, as what “entropy” should mean and how it relates to complexity, but I also refined the concept of alt-complexity in that interaction. Thanks Adam!)

Takeaways

I have personally abandoned the concept of K-complexity, in favor of alt-complexity. I recommend it.

In my own notes, I call this new concept “complexity”. Sometimes I call it “-complexity”, where is the programming language, when I want to make that dependence explicit. If “complexity” isn’t specific enough to be unambiguous, I encourage commenters to brainstorm alternative names, before “alt-complexity” sticks.


  1. ↩︎

    K-complexity performs better when the programming language is a language of probabilistic programs, rather than deterministic ones. That said, this is basically just a way of bringing K-complexity closer to the correct notion of alt-complexity, and the pathologies discussed above can still be exhibited in the case of probabilistic programs.

    Furthermore, if you’re working with alt-complexity instead of K-complexity, there’s less need to work with probabilistic programs. You can just work with collections of deterministic programs instead. These views can be brought closer together by thinking of computer programs not as finite codes, but as infinite bitstrings, which we think of as a prefix-free code for a finite program plus a random seed. But (a) that would be a digression from my main point, and also (b) I’m still a bit confused about something to do with representations of programs in Solomonoff induction, so that’s all I’ll say on the topic here.

  2. ↩︎

    Here’s my current understanding of the story, as I shall now briefly attempt to render into (local) layperson’s terms. I enthusiastically solicit corrections from people who understand the theory well enough to critique me.

    Quantum mechanics tells us that the nature of reality is a (certain sort of well-behaved) function from possible-configurations-of-the-entire-universe to complex numbers (called the “amplitude” of the configuration).

    The laws of physics are phase-invariant, in the sense that if we rotate all of those complex numbers by the same fixed angle, then this doesn’t change anything.

    This picture does a great job at predicting certain atomic processes, but it’s not very compatible with relativity. Because, like, once you’ve thrown simultaneity out the window, concepts like “ways that the universe can be configured across all of space in a single instant” are looking pretty shaky.

    For some reason I haven’t understood yet, some madlad physicist was like “ok, but what if we postulate some sort of superpowered version of phase invariance, where the angle that we’re rotating each of the complex numbers by depends on where things are in space.” By doing something like(?) choosing a function from spacetime to the unit cicrle U(1), and taking each configuration on particles, and multiplying its amplitude by the different points on U(1) given by the map at the different particle positions (according to this configuration)? I think? So far I’ve mostly stared at simple versions of the equation with 1-particle systems, and haven’t managed to understand the texts about the more general case here, and it’s also plausible to me that there’s different functions from spacetime to U(1), or that I’ve totally misunderstood what I’m doing. Clarification is welcome. But the point is, we can hypothesize some sort of function(s) to U(1), and we can somehow imagine perturbing the wave function not by a uniform phase-change, but by a spacetime-dependent phase change.

    And you might hope that the laws of physics would be invariant under these weird spacetime-dependent phase changes.

    But that’s also kinda ridiculous, right? Like, it’s one thing to have laws of physics that don’t depend on whether you think the center of the world is located in New York or Chicago or LA. But it’s another thing entirely to say that I’m allowed to give you an arbitrary topological map of the universe where I get to make up all the distances according to my whims, and say that the laws of physics surely shouldn’t be invariant to that. Like, if I arbitrarily declare that the distance from New York to Chicago is a million distance-units, and the distance from Chicago to Los Angeles is sixteen distance-units, then the laws of physics must say that a package travelling down a frictionless vacuum-tube from New York to Los Angeles must experience a dramatic spontaneous slowdown sometime after crossing Chicago, for no apparent reason, contra normal laws like “momentum is conserved”. You can’t expect physics to be invariant if you start smacking its parameters differently in different places, right? That should have visible effects, in regions where your arbitrary choices change!

    And, like, if you actually take a perfectly good quantum system of electrons and smack it with a (non-constant) map (from spacetime) to U(1), then the electrons will indeed do some jerking-around, when they move through regions where the map changes. So we can tell the difference between perturbing phases in this way, and not perturbing phases in this way.

    Or, at least, we would be able to tell the difference, so long as there wasn’t some other sort of particles that reacted to the map in a sort of “equal and opposite” way. Such that, like, you think the map to U(1) is constant but there’re some other particles floating around that bop the electrons causing them to jerk, and I think that the map to U(1) is non-constant and there are no such particles.

    And so this theory can be empirically tested! Quick, look around: do you notice electrons jerking around? Like, are the electrons in your photoreceptors jiggling such that you can see anything at all?

    Because that second field of particles totally turns out to exist! It’s the photon field! And it’s coupled to the electrical field in just such a way that the laws of physics are invariant under spacetime-dependent phase-changes; everyone agrees about what happens, but different observers (with different maps to U(1)) disagree about which electron-jerks were caused by photons!

    The laws of physics are apparently so adamant about being invariant under a spacetime-dependent change-of-phase that reality would rather bolt on a whole new breed of particle (whose presence in any particular case different people can disagree about) than violate this wacky invariance.

    (From which we learn something about the language in which physics is itself simple: it is a language in which particle-fields are cheaper than symmetry-violations, even for symmetries that (so far) seem pretty wacky to me.)

    What does this mean, for our purposes? Well, if I understand correctly, there is not (in general) a way to choose the map to U(1) such that there seem to be no photons. (You can probably choose a map to U(1) such that any given photon looks ‘fictitious’, but you can’t in general make them all disappear simultaneously.) My guess is that there’s also not a canonical way to choose “the” constant map to U(1), similar to how you can’t just choose “the” rest frame in relativity: somebody’s gotta pick out the basis vectors, and picking them out involves choosing quite a lot of data.

    I again caveat that my grasp on gauge theory is rather tenuous, and I solicit better explanations from people who have done more than just staring at the quantum electrodynamic Dirac Lagrangian for a few hours.