A hundredth of a bit of extra entropy

There are two ways to calculate the amount of information in one term of a continued fraction:

  • The entropy of the Gauss-Kuzmin distribution is about 3.4325 bits.

  • Twice the logarithm of the Khinchin-Lévy constant is about 3.4237 bits.

These differ by about 0.0088 bits. It took me a while to figure out why they were different at all, and now I’m surprised by how close they are.

Continued fractions

π is about 3.

Actually, 3 is a bit too small, by about 17.

Actually, 7 is a bit too small, by about 115.

Actually, 15 is a bit a lot too small, by almost 11.

But the 1 in the denominator is a tiny bit too small, by about 1292.

We can keep doing this forever, and get an expression like:

This is pretty useful for things like finding rational approximations. We can drop that 1/​(292+...), since it’s really small, and get the excellent approximation . Or we can drop the 1/​(15+...) and get the rougher approximation .

In general, the continued fraction for a real number is of the form

where the terms are positive integers (except which is an arbitrary integer).

This is infinite for any irrational number; for a rational number, it stops after some number of terms, e.g.

Truncating a continued fraction[1] is the optimal way to approximate a real number with a rational: it gives a better approximation than anything with a smaller denominator.

Continued fractions show up every now and then in different ways, but the approximation property is usually how I end up bumping into them.

The Gauss-Kuzmin distribution

Suppose we choose a random real number (say, uniformly between 0 and 1). The probability that the th term is equal to converges to as .[2] This is the Gauss-Kuzmin distribution.

I think it’s also true that the fraction of ’s in the continued fraction converges to .[3]

The entropy of this distribution is just ,[4] which is pretty quick to compute out to a few digits of accuracy in your favorite programming language. It works out to about bits, so that’s how much information there is in the term , for large .

Lochs’s theorem

Lochs’s theorem: For almost all real numbers, asymptotically, the accuracy of the continued fraction improves by a factor of per term.[5] This is amusingly close to 10, which is how much the accuracy of decimals improves per digit.

You get bits of information from a digit of decimal, so this means that you get bits[6] of information per term of a continued fraction, asymptotically.

The difference

So what gives? Why aren’t those the same??

Hint:

It’s not about how you take the limits to get these two things, as I initially thought. Eventually, the error in both numbers above will drop so far that 0.0088 bits looks enormous by comparison, so we can just think about the behavior after a bajillion terms and not worry about the fact that both theorems are asymptotic.

The answer, unless I’m mistaken:

The Gauss-Kuzmin distribution is a marginal distribution, so its entropy is the entropy of the nth term if you don’t know anything about previous terms.

The Khinchin-Lévy constant is the entropy of the conditional distribution; it measures the entropy of the nth term if you know every previous term.

These are different because—evidently—different terms of the continued fraction are correlated, even for a generic real number, even in the limit of large n!

I implicitly assumed that the correlation would tend to zero, and barely noticed I was making this assumption.

It seems that if you know the first bajillion terms of a continued fraction, you actually have a tiny amount of information about the bajillion-plus-first term; less than a hundredth of a bit, but not zero.

I haven’t tried to calculate the conditional distribution and confirm this yet, because it seems really hard to measure accurately. But I’m tempted, because I have no idea what this tiny correlation looks like. Is it just about consecutive terms, or longer-ranged than that? Is the size of consecutive terms correlated or anticorrelated? Is the information concentrated in one outcome—“the next term will probably not be 283434”—or spread out evenly across the whole distribution? Presumably there are results on this in the literature?

But I’m still confused about why there’s only a 2% difference. Why should these numbers be so close, but still different?

If you have any intuition for this, please let me know.

  1. ^

    If I’m remembering this right, you truncate by replacing a term with either or . (In other words, by rounding down to zero or up to one.)

  2. ^

    The logarithm has to be base 2 to make the probabilities add up to 1. Isn’t that weird?

  3. ^

    Formally: with probability 1, p(k) is the limit of the fraction of ’s in the first terms, as .

    This part of the post originally said this was equivalent to the previous statement, but it’s actually stronger.

  4. ^

    Yes, there’s gonna a double logarithm in there.

  5. ^

    Just to throw another theorem on the pile: for the rational approximation you get by truncating after terms, the th root of the denominator approaches as . This is the square root of the number in Lochs’s theorem and is called the Khinchin-Lévy constant, after Khinchin who proved it was a constant and Lévy who found its value using the Gauss-Kuzmin distribution.

    Why the square root? I think because the accuracy of these approximations goes like the square of the denominator—see also Dirichlet’s approximation theorem.

    Lochs’s theorem came late; I assume it relies on Lévy’s work but I haven’t seen the proof.

  6. ^

    One factor of in the denominator is to make the unit “bits”, and the other is from the exponent of the Khinchin-Lévy constant (and related to the Gauss-Kuzmin distribution having in it, I believe).