The Coding Theorem — A Link between Complexity and Probability
This post has been written in collaboration with Iliad in service of one of Iliad’s longer-term goals of understanding the simplicity bias of learning machines.
In this post, I give a self-contained treatment, including a proof, of the coding theorem.[1] Let
This statement has found some prior interest on Lesswrong. Alexander Gietelink Oldenziel called the argument behind one inequality the padding argument, after it was independently rediscovered by Matthias Dellago and Lucius Bushnaq.[3] Nate Soares wrote a whole post about his rediscovery of some of these ideas, for the case that strings
The motivation of this post is to provide a didactical and elementary proof for readers, avoiding the typical machinery of “lower semicomputable discrete semimeasures”, and to have a reference for future posts. In particular, Alexander and I are currently writing a post on the complexity of functions learned by neural networks, in which we would like to refer to this result.
Audience: This post assumes basic knowledge of Turing machines, the Church-Turing thesis, and the Kolmogorov complexity for prefix-free Turing machines (often called prefix Turing machines; I find it clearer to append “free”). We will briefly recall some basic definitions as a reminder.
Contributions: It is possible that all parts of the proofs can already be found identically in the literature, but I did not try to find them. OpenAI’s o3 found the idea for the dovetailing procedure. The proof of the efficient algorithmic Kraft coding in the appendix is mine. The entire post is written by myself, except the last paragraph of the following section, which was first drafted by GPT-5.
Acknowledgments: I thank Alexander Gietelink Oldenziel and Matthias Dellago for engaging discussions about the coding theorem and the padding argument.
Formulating the Coding Theorem
Let
We construct
This total probability is also called the halting probability; it is the probability that a randomly selected program
For a program
Lemma 1 (Deadcode/padding argument). We have
Proof. We first claim
The inclusion from left to right is clear: Every infinite
Thus, we obtain
In the last step, we used that the likelihood of sampling an infinite sequence that starts with
Now, let
Theorem 2 (Coding Theorem). There is a constant
In other words, up to at most a multiplicative constant,
The rest of the post will be about proving the second inequality, which is much harder than the first. It roughly states that if
The rough outline is as follows: Our proof of this inequality constructs, for each string
Algorithmic Kraft Coding
We need an algorithmic version of the Kraft inequality as a tool. If you are not interested in the details, feel free to only read the statement of Theorem 4 and then move on to the next section.
For a binary string
Lemma 3. Let
Proof. Assume
All inequalities except the last are trivial. The last inequality follows since adding
For the other direction, assume we have
We are now ready to state the following theorem. I was told it is essentially equivalent to Theorem 3.6.1 in the book “Algorithmic Randomness and Complexity” by Downey and Hirschfeldt, which is therein called the KC theorem and attributed to Levin.
Theorem 4 (Algorithmic Kraft Coding). Let
Then there exist codewords
forms a prefix-free code, i.e., no is the prefix of any .The codelengths are given by
.
Furthermore, there is an algorithm which successively maps the lengths
Proof. Here, we present a simple proof that only achieves slightly less efficient codelengths
The algorithm proceeds as follows: Upon receiving the sequence
To construct the codeword
Now we prove the first property (the second one holds by construction, though with lengths
Consequently, the dyadic interval satisfies
Remark: The statement of the theorem is stronger than many formulatios of Kraft’s inequality since I require there to be an algorithm that selects one codeword at a time. This is what makes my proof of the efficient codewords in the appendix so complicated.
The reverse of this theorem is also true, i.e., the codelengths of any prefix-free code satisfy Kraft’s inequality, see the Wikipedia page.
A Proof of the Coding Theorem
We now prove the second inequality of Theorem 2. Recall that we want to prove
We now explain how to code binary strings
We proceed as follows: For all
When this happens, check whether
For this to be well-defined, we need to check that all the codelengths
Thus, algorithmic Kraft coding can indeed be applied.
Now, if you have a codeword
where
Now, note that for a given
Combining with the previous equation, we obtain
finishing the proof of Theorem 2.
Appendix: Proof of Efficient Algorithmic Kraft Coding
Here, we present a more complicated algorithm that achieves the efficient codelengths
We now explain the details of how to do this. Initialize the set of “found codewords”
is prefix-free. , and the codelengths are given by for .No codelength is represented more than once in
. .
In the course of this proof, we will progressively replace elements of
Let
Note that all four properties still hold for
Now, assume by induction that
Finally, consider the case that there is no
This would result in
in contradiction to the assumed Kraft’s inequality. Thus,
Now, set
Define
We now show that all four properties remain true with these constructions. The newly added elements in
For property
Thus, codelength
Property
As we explained before, this concludes the proof.
- ^
See Theorem 4.3.3 in Li and Vitányi’s “An Introduction to Kolmogorov Complexity and Its Applications”. This was first proved by L.A. Levin in 1974.
- ^
Importantly, this result is only correct up to a logarithmic error for plain, or descriptional, Kolmogorov complexity. Thus, the assumption that we work with prefix-free Turing machines is crucial. Thanks to Daniel Filan for pointing that out.
- ^
The term “padding” is also used in this context in the book “An Introduction to Universal Artificial Intelligence” by Hutter, Quarel, and Catt.
- ^
In particular,
halts on whenever halts on . - ^
My guess is one could also start with the resource
consisting of only the empty binary string, which would possibly be somewhat cleaner than the choice I made.
- A Technical Introduction to Solomonoff Induction without K-Complexity by (26 Nov 2025 21:36 UTC; 76 points)
- A philosophical kernel: biting analytic bullets by (15 Aug 2025 1:35 UTC; 64 points)
- 's comment on K-complexity is silly; use cross-entropy instead by (10 Aug 2025 15:41 UTC; 4 points)
- 's comment on Cole Wyeth’s Shortform by (6 Oct 2025 20:20 UTC; 2 points)
- 's comment on Alexander Gietelink Oldenziel’s Shortform by (10 Aug 2025 15:39 UTC; 2 points)
Excellent! Great to have a cleanly formulated article to point people to!
In case anyone else didn’t know what it meant for a set of binary strings to be “prefix-free”, here’s Claude’s explanation, which I found helpful:
Thanks for adding!
Not sure if you were aware, but in the glossary at the top-right of the post, there is also an explanation (albeit shorter) of “prefix-free code”. I’m just mentioning this in case you weren’t aware of the glossary functionality.
Ah I hadn’t noticed that, very nice. Great post!
Should be 1?
What is the context? The number 0 appears a bunch of times.
Ah sorry, I was using LW’s feature of highlighting something in the text and then making a comment about it. This is the part that says;
I see! Yes, edited.
It wasn’t wrong before the change, but is clearly more meaningful. Thanks!