There are two ways to calculate the amount of information in one term of a continued fraction:

The entropy of the Gauss-Kuzmin distribution is about 3.4325 bits.
Twice the logarithm of the Khinchin-Lévy constant is about 3.4237 bits.

These differ by about 0.0088 bits. It took me a while to figure out why they were different at all, and now I’m surprised by how close they are.

Continued fractions

π is about 3.

Actually, 3 is a bit too small, by about ¹⁄₇.

Actually, 7 is a bit too small, by about ¹⁄₁₅.

Actually, 15 is ~~a bit~~ a lot too small, by almost ¹⁄₁.

But the 1 in the denominator is a tiny bit too small, by about ¹⁄₂₉₂.

We can keep doing this forever, and get an expression like:

$π = 3 + \frac{1}{7 + \frac{1}{15 + \frac{1}{1 + \frac{1}{292 +_{⋱}}}}}$

This is pretty useful for things like finding rational approximations. We can drop that 1/(292+...), since it’s really small, and get the excellent approximation $π \approx \frac{355}{113}$ . Or we can drop the 1/(15+...) and get the rougher approximation $π \approx \frac{22}{7}$ .

In general, the continued fraction for a real number $r$ is of the form

$r = a_{0} + \frac{1}{a_{1} + \frac{1}{a_{2} + \frac{1}{a_{3} +_{⋱}}}}$

where the terms $a_{n}$ are positive integers (except $a_{0}$ which is an arbitrary integer).

This is infinite for any irrational number; for a rational number, it stops after some number of terms, e.g.

$\frac{355}{113} = 3 + \frac{1}{7 + \frac{1}{15 + \frac{1}{1}}}$

Truncating a continued fraction^[1] is the optimal way to approximate a real number with a rational: it gives a better approximation than anything with a smaller denominator.

Continued fractions show up every now and then in different ways, but the approximation property is usually how I end up bumping into them.

The Gauss-Kuzmin distribution

Suppose we choose a random real number (say, uniformly between 0 and 1). The probability that the $n$ th term is equal to $k$ converges to $p (k) = - {log}_{2} (1 - \frac{1}{(1 + k)^{2}})$ as $n \to \infty$ .^[2] This is the Gauss-Kuzmin distribution.

I think it’s also true that the fraction of $k$ ’s in the continued fraction converges to $p (k)$ .^[3]

The entropy of this distribution is just $\sum_{k = 1}^{\infty} - p (k) {log}_{2} (p (k))$ ,^[4] which is pretty quick to compute out to a few digits of accuracy in your favorite programming language. It works out to about $3.4325$ bits, so that’s how much information there is in the term $a_{n}$ , for large $n$ .

Lochs’s theorem

Lochs’s theorem: For almost all real numbers, asymptotically, the accuracy of the continued fraction improves by a factor of $e^{π^{2} / 6 ln 2} \approx 10.731$ per term.^[5] This is amusingly close to 10, which is how much the accuracy of decimals improves per digit.

You get ${log}_{2} 10$ bits of information from a digit of decimal, so this means that you get $\frac{π^{2}}{6 (ln 2)^{2}} \approx 3.4237$ bits^[6] of information per term of a continued fraction, asymptotically.

The difference

So what gives? Why aren’t those the same??

Hint:

It’s not about how you take the limits to get these two things, as I initially thought. Eventually, the error in both numbers above will drop so far that 0.0088 bits looks enormous by comparison, so we can just think about the behavior after a bajillion terms and not worry about the fact that both theorems are asymptotic.

The answer, unless I’m mistaken:

The Gauss-Kuzmin distribution is a marginal distribution, so its entropy is the entropy of the nth term if you don’t know anything about previous terms.

The Khinchin-Lévy constant is the entropy of the conditional distribution; it measures the entropy of the nth term if you know every previous term.

These are different because—evidently—different terms of the continued fraction are correlated, even for a generic real number, even in the limit of large n!

I implicitly assumed that the correlation would tend to zero, and barely noticed I was making this assumption.

It seems that if you know the first bajillion terms of a continued fraction, you actually have a tiny amount of information about the bajillion-plus-first term; less than a hundredth of a bit, but not zero.

I haven’t tried to calculate the conditional distribution and confirm this yet, because it seems really hard to measure accurately. But I’m tempted, because I have no idea what this tiny correlation looks like. Is it just about consecutive terms, or longer-ranged than that? Is the size of consecutive terms correlated or anticorrelated? Is the information concentrated in one outcome—“the next term will probably not be 283434”—or spread out evenly across the whole distribution? Presumably there are results on this in the literature?

But I’m still confused about why there’s only a 2% difference. Why should these numbers be so close, but still different?

If you have any intuition for this, please let me know.

^
If I’m remembering this right, you truncate by replacing a term $a_{n}$ with either $1$ or $\infty$ . (In other words, by rounding $1 / (a_{n} + \dots)$ down to zero or up to one.)
^
The logarithm has to be base 2 to make the probabilities add up to 1. Isn’t that weird?
^
Formally: with probability 1, p(k) is the limit of the fraction of $k$ ’s in the first $N$ terms, as $N \to \infty$ .
This part of the post originally said this was equivalent to the previous statement, but it’s actually stronger.
^
Yes, there’s gonna a double logarithm in there.
^
Just to throw another theorem on the pile: for the rational approximation you get by truncating after $n$ terms, the $n$ th root of the denominator approaches $e^{π^{2} / 12 ln 2}$ as $n \to \infty$ . This is the square root of the number in Lochs’s theorem and is called the Khinchin-Lévy constant, after Khinchin who proved it was a constant and Lévy who found its value using the Gauss-Kuzmin distribution.

Why the square root? I think because the accuracy of these approximations goes like the square of the denominator—see also Dirichlet’s approximation theorem.
Lochs’s theorem came late; I assume it relies on Lévy’s work but I haven’t seen the proof.
^
One factor of $ln 2$ in the denominator is to make the unit “bits”, and the other is from the exponent of the Khinchin-Lévy constant (and related to the Gauss-Kuzmin distribution having ${log}_{2}$ in it, I believe).

A hundredth of a bit of extra entropy

Continued fractions

The Gauss-Kuzmin distribution

Lochs’s theorem

The difference