Message Length

Someone is broadcasting a stream of bits. You don’t know why. A 500-bit-long sample looks like this:

01100110110101011011111100001001110000100011010001101011011010000001010000001010
10100111101000101111010100100101010010101010101000010100110101010011111111010101
01010101011111110101011010101101111101010110110101010100000001101111100000111010
11100000000000001111101010110101010101001010101101010101100111001100001100110101
11111111111111111100011001011010011010101010101100000010101011101101010010110011
11111010111101110100010101010111001111010001101101010101101011000101100000101010
10011001101010101111...

The thought occurs to you to do Science to it—to ponder if there’s some way you could better predict what bits are going to come next. At first you think you can’t—it’s just a bunch of random bits. You can’t predict it, because that’s what random means.

Or does it? True, if the sequence represented flips of a fair coin—every flip independently landing either 0 or 1 with exactly equal probability—then there would be no way you could predict what would come next: any continuation you could posit would be exactly as probable as any other.

But if the sequence represented flips of a biased coin—if, say, 1 came up 0.55 of the time instead of exactly 0.5—then it would be possible to predict better or worse. Your best bet for the next bit in isolation would always be 1, and you would more strongly anticipate sequences with slightly more 1s than 0s.

You count 265 1s in the sample of 500 bits. Given the hypothesis that the bits were generated by a fair coin, the number of 1s (or without loss of generality, 0s) would be given by the binomial distribution , which has a standard deviation of , so your observation of excess 1s is about standard deviations from the mean—well within the realm of plausibility of happening by chance, although you’re at least slightly suspicious that the coin behind these bits might not be quite fair.

… that is, if it’s even a coin. You love talking in terms of shiny, if hypothetical, “coins” rather than stodgy old “independent and identically distributed binary-valued random variables”, but looking at the sample again, you begin to further doubt whether the bits are independent of each other. You’ve heard that humans are biased to overestimate the frequency of alternations (101010...) and underestimate the frequency of consecutive runs (00000… or 11111…) in “truly” (uniformly) random data, but the 500-bit sample contains a run of 13 0s (starting at position 243) and a run of 19 1s (starting at position 319). You’re not immediately sure how to calculate the probability of that, but your gut says that should be very unlikely given the biased-coin model, even after taking into account that human guts aren’t very good at estimating these things.

Maybe not everything in the universe is a coin. What if the bits were being generated by a Markov chain—if the probability of the next bit depended on the value of the one just before? If a 0 made the next bit more likely to be a 0, and the same for 1, that would make the 00000… and 11111… runs less improbable.

Except … the sample also has a run of 17 alternations (starting at position 153). On the “fair coin” model, shouldn’t that itself be times as suspicious as the run of 13 0s and as suspicious as the run of 19 1s which led you to hypothesize a Markov chain?—or rather, 8 and times as suspicious, respectively, because there are two ways for an alternation to occur (0101010… or 1010101…).

A Markov chain in which a 0 or 1 makes another of the same more likely, makes alternations less likely: the Markov chain hypothesis can only make the consecutive runs look less surprising at the expense of making the run of alternations look more surprising.

So maybe it’s all just a coincidence: the broadcast is random—whatever that means—and you’re just apophenically pattern-matching on noise. Unless …

Could it be that some things in the universe are neither coins nor Markov chains? You don’t know who is broadcasting these bits or why; you called it “random” because you didn’t see any obvious pattern, but now that you think about it, it would be pretty weird for someone to just be broadcasting random bits. Probably the broadcast is something like a movie or a stock ticker; if a close-up sample of the individual bits looks “random”, that’s only because you don’t know the codec.

Trying to guess a video codec is obviously impossible. Does that kill all hope of being able to better predict future bits? Maybe not. Even if you don’t know what the broadcast is really for, there might be some nontrivial local structure to it, where bits are statistically related to the bits nearby, like how a dumb encoding of a video might have consecutive runs of the same bit-pattern where a large portion of a frame is the same color, like the sky.

Local structure, where bits are statistically related to the bits nearby … kind of like a Markov chain, except in a Markov chain the probability of the next state only depends on the one immediately before, which is a pretty narrow notion of “nearby.” To broaden that, you could imagine the bits are being generated by a higher-order Markov chain, where the probability of the next bit depends on the previous n bits for some specific value of n.

And that’s how you can explain mysteriously frequent consecutive runs and alternations. If the last two bits being 01 (respectively 10) makes it more likely for the next bit to be 0 (respectively 1), and the last two bits being 00 (respectively 11) makes it more likely for the next bit to be 0 (respectively 1), then you would be more likely to see both long 0000… or 1111… consecutive runs and 01010… alternations.

A biased coin is just an n-th-order Markov chain where n = 0. An n-th-order Markov chain where n > 1, is just a first-order Markov chain where each “state” is a tuple of bits, rather than a single bit.

Everything in the universe is a Markov chain!—with respect to the models you’ve considered so far.

“The bits are being generated by a Markov chain of some order” is a theory, but a pretty broad one. To make it concrete enough to test, you need to posit some specific order n, and, given n, specific parameters for the next-bit-given-previous-n probabilities.

The n = 0 coin has one parameter: the bias of the coin, the probability of the next bit being 0. (Or without loss of generality 1; we just need one parameter p to specify the probability of one of the two possibilities, and then the probability of the other will be 1 − p.)

The n = 1 ordinary Markov chain has two parameters: the probability of the next bit being (without loss of generality) 0 given that the last bit was a 0, and the probability of the next bit being (without loss of …) 0 given that the last bit was a 1.

The n = 2 second-order Markov chain has four parameters: the probability of the next bit being (without loss …) 0 given that the last two bits were 00, the probability of the next bit being (without …) 0 given that the last two bits were 01, the probability of the next bit being—

Enough! You get it! The n-th order Markov chain has parameters!

Okay, but then how do you guess the parameters?

For the n = 0 coin, your best guess at the frequency-of-0 parameter is going to just be the frequency of 0s you’ve observed. Your best guess could easily be wrong, and probably is: just because you observed 235500 = 0.47 0s, doesn’t mean the parameter is 0.47: it’s probably somewhat lower or higher, and your sample got more or fewer 0s than average just by chance. But positing that the observed frequency is the actual parameter is the maximum likelihood estimate—the single value that most makes the data “look normal”.

For n ≥ 1, it’s the same idea: your best guess for the frequency-of0—after-0 parameter is just the frequency of 0 being the next bit, among all the places where 0 was the last bit, and so on.

You can write a program that takes the data and a degree n, and computes the maximum-likelihood estimate for the n-th order Markov chain that might have produced that data. Just slide an (n+1)-bit “window” over the data, and keep a tally of the frequencies of the “plus-one” last bit, for each of the possible n-bit patterns.

In the Rust programming language, that looks like following. (Where the representation of our final theory is output as a map (HashMap) from n+1-bit-patterns to frequencies/​parameter-values (stored as a thirty-two bit floating-point number, f32).)

fn maximum_likelihood_estimate(data: &[Bit], degree: usize) -> HashMap<(Vec<Bit>, Bit), f32> {
    let mut theory = HashMap::with_capacity(2usize.pow(degree as u32));
    // Cartesian product—e.g., if degree 2, [00, 01, 10, 11]
    let patterns = bit_product(degree);
    for pattern in patterns {
        let mut zero_continuations = 0;
        let mut one_continuations = 0;
        for window in data.windows(degree + 1) {
            let (prefix, tail) = window.split_at(degree);
            let next = tail[0];
            if prefix == pattern {
                match next {
                    ZERO => {
                        zero_continuations += 1;
                    }
                    ONE => {
                        one_continuations += 1;
                    }
                }
            }
        }
        let continuations = zero_continuations + one_continuations;
        theory.insert(
            (pattern.clone(), ZERO),
            zero_continuations as f32 / continuations as f32,
        );
        theory.insert(
            (pattern.clone(), ONE),
            one_continuations as f32 / continuations as f32,
        );
    }
    theory
}

Now that you have the best theory for each particular n, you can compare how well each of them predict the data! For example, according to n = 0 coin model with maximum-likelihood parameter p = 0.47, the probability of your 500-bit sample is about … 0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000007517883433770135.

Uh. The tiny probability makes sense: there’s a lot of randomness in 500 flips of a biased coin. Even if you know the bias, the probability of any particular 500-flip sequence is going to be tiny. But a number that tiny is kind of unwieldy to work with. You’d almost rather just count the zeros and ignore the specific digits afterwards.

But counting the zeros is just taking the logarithm—well, the negative logarithm in the case of zeros after the decimal point. Better make the log base-two—it’s thematic. Call this measurement the log loss.

fn log_loss(theory: &HashMap<(Vec<Bit>, Bit), f32>, data: &[Bit]) -> f32 {
    let mut total = 0.;
    let degree = log2(theory.keys().count()) - 1;
    for window in data.windows(degree + 1) {
        let (prefix, tail) = window.split_at(degree);
        let next = tail[0];
        total += -theory
            .get(&(prefix.to_vec(), next))
            .expect("theory should have param value for prefix-and-continuation")
            .log2();
    }
    total
}

Now you can compare different theories to see which order of Markov chain is the best theory to “fit” your 500-bit sample … right?

    for hypothesized_degree in 0..15 {
        let theory = maximum_likelihood_estimate(&data, hypothesized_degree);
        println!(
            "{}th-order theory: fit = {}",
            hypothesized_degree,
            log_loss(&theory, &data)
        );
    }
0th-order theory: fit = 498.69882
1th-order theory: fit = 483.86075
2th-order theory: fit = 459.01752
3th-order theory: fit = 438.90198
4th-order theory: fit = 435.9401
5th-order theory: fit = 425.77222
6th-order theory: fit = 404.2693
7th-order theory: fit = 344.68494
8th-order theory: fit = 270.51175
9th-order theory: fit = 199.88765
10th-order theory: fit = 147.10117
11th-order theory: fit = 107.72962
12th-order theory: fit = 79.99724
13th-order theory: fit = 57.16126
14th-order theory: fit = 33.409912

There’s a problem. Higher choices of n monotonically achieve a better “fit”. You got the idea of higher-order Markov chains because the idea of a biased coin didn’t seem adequate to explain the consecutive and alternating runs you saw, but you somehow have trouble believing that the bitstream was generated by a fifteenth-order Markov chain with a completely separate probability for the next bit for each of the = 32,768 prefixes 000000000000000, 000000000000001, 000000000000010, &c. Having had the “higher-order Markov chain” idea, are you now obligated to set n as large as possible? What would that even mean?

In retrospect, the problem should have been obvious from the start. Using your sample data to choose maximum-likelihood parameters, and then using the model with those parameters to “predict” the same data puts you in the position of the vaunted “sharpshooter” who paints a target around a clump of bullet holes after firing wildly at the broad side of a barn.

Higher values of n are like a … thinner paintbrush?—or a squigglier, more “gerrymandered” painting of a target. Higher-order Markov chains are strictly more expressive than lower-order ones: the zeroth-order coin is just a first-order Markov chain where the next-bit-after-0 and next-bit-after-1 parameters just happen to be the same; the first-order Markov chain is just a second-order chain where the next-bit-after-00 and next-bit-after-10 parameters happen to be the same, as well as the next-bit-after-01 and—enough! You get it!

The broadcast is ongoing; you’re not limited to the particular 500-bit sample you’ve been playing with. If the worry were just that the higher-order models will (somehow, you intuit) fail to predict future data, you could use different samples for estimating parameters and validating the resulting models, but you think you’re suffering from some more fundamental confusion—one that’s probably not limited to Markov chains in particular.

Your working concept of what it means for a theory to “fit” the data, is for it to maximize the probability with which the theory predicts the data. This is an objective, quantitative measurement. (Okay, the log loss is taking the negative logarithm of that to avoid so many zeros after the decimal point, but minimizing the log loss and maximizing the probability are both expressing the same preference on theories.)

How do you know (and your gut says that you know) that the higher-order models will do badly on future data, if your objective criterion of model-goodness says they’re better? The log loss always “wants” to you to choose ever-more-complex models. You asked: what would that even mean? But maybe it doesn’t have to be a rhetorical question: what would that even mean?

Well … in the limit, you could choose a theory that assigns Probability One to the observed data. The “too many zeros”/​”avoid working with really tiny numbers” justification for taking the negative log doesn’t really apply here, but for consistency with your earlier results, you dutifully note that the logarithm of 1 is 0 …

But maybe “too many zeros” isn’t the only good motivation for taking the logarithm? Intelligence is prediction is compression. The log loss of a model against the data can be interpreted as the expected number of bits you would need to describe the data, given the optimal code implied by your model.

In order to communicate a reduction in your uncertainty, you need to send a signal—something you can choose to vary in response to the reality of the data. A signal you can vary to take two possible states, can distinguish between two sets among which you’ve divided the remaining possibilities; writing down a bit means halving your uncertainty.

On this interpretation, what the logarithm of Probability One being zero means is that if your theory predicted the exact outcome with certainty, then once you stated the theory, you wouldn’t have to say anything more in order to describe the data—you would just know with zero further bits.

Once you stated the theory. A theory implies an optimally efficient coding by which further bits can whittle down the space of possibilities to the data that actually happened. More complicated or unlikely data requires more bits just to specify—to single out that one outcome amongst the vastness of equally remote alternatives. But the same thing goes for theories.

Given a particular precision to which parameters are specified, there are exponentially more Markov chains of higher degrees, which can continue to drive down the log loss—but not faster than their own probability decreases. You need exponentially more data just to learn the parameters of a higher-order model. If you don’t have that much data—enough to pin down the parameters that single out this particular higher-order Markov chain amongst the vastness of equally remote alternatives—then your maximum-likelihood best guess is not going to be very good on future data, for the same reason you can’t expect to correctly guess that a biased coin has a probability of landing Heads of exactly 0.23 if you’ve only seen it flipped twice.

If you do have enough data to learn a more complex model, but the data was actually generated by a simpler model, then the parameters of the complex model will approximately take the settings that produce the same behavior as the simpler model—like a second-order Markov chain for which the bit-following-01 parameter happens to take the same value as the bit-following-11 parameter. And if you’re deciding what theory to prefer based on both fit and complexity, the more complex model won’t be able to “pay” for its increased complexity with its own predictions.

Now that you know what’s going on, you can modify your code to penalize more complex models. Since the parameters in your implementation are 32-bit floats, you assign a complexity cost of bits to n-th order Markov chains, and look at the sum of fit (log loss) and complexity. Trying out your code again on a larger sample of 10,000 bits from the broadcast—

    for hypothesized_degree in 0..10 {
        let theory = maximum_likelihood_estimate(&data, hypothesized_degree);
        let fit = log_loss(&theory, &data);
        let complexity = 2f32.powi(hypothesized_degree as i32) * 32.;
        println!(
            "{}th-order theory: fit = {}, complexity = {}, total = {}",
            hypothesized_degree, fit, complexity, fit + complexity
        );
    }
0th-order theory: fit = 9970.838, complexity = 32, total = 10002.838
1th-order theory: fit = 9677.269, complexity = 64, total = 9741.269
2th-order theory: fit = 9111.029, complexity = 128, total = 9239.029
3th-order theory: fit = 8646.953, complexity = 256, total = 8902.953
4th-order theory: fit = 8638.786, complexity = 512, total = 9150.786
5th-order theory: fit = 8627.224, complexity = 1024, total = 9651.224
6th-order theory: fit = 8610.54, complexity = 2048, total = 10658.54
7th-order theory: fit = 8562.568, complexity = 4096, total = 12658.568
8th-order theory: fit = 8470.953, complexity = 8192, total = 16662.953
9th-order theory: fit = 8262.546, complexity = 16384, total = 24646.547

—reveals a clear preference for the third-order theory (that for which the fit-plus-complexity score is the lowest), allowing you to enjoy the huge 450-plus–bit leap in compression/​prediction from n := 2 to 3 and logically stop there, the steepness of the ascent into the madness of arbitrary complexity successfully dissuading you from chasing after diminishing returns (which you suspect are only hallucinatory). That’s the power packed by parsimony—the sublime simplicity of Science.

(Full source code.)