This is the second post in a series that explores a circle of ideas around entropy and information processing in complex systems. We hope it will form an accessible introduction to James Crutchfield’s Computational Mechanics framework.

In the previous post we saw that

Data is, in general, a mixture of randomly generated components and deterministic hidden structure
Kolmogorov Complexity and Shannon Entropy are powerful measures of structure. However, we also encountered some limitations: Kolmogorov complexity overestimates deterministic structure while the Shannon Entropy underestimates deterministic structure.
We need a framework that allows us to find and characterize both structured and random parts of our data.

In this post we will show that we can characterize both structure and randomness in a system by looking at how entropy (irreducible uncertainty) changes over increasing scales. This will lead to ways in which we can understand transformer function in language tasks.

Structure is found in how entropy scales

We want something conceptually in between the randomness of Shannon Entropy and the determinism of Kolmogorov complexity. The way to bridge the gap is by realizing that systems that have deterministic structure at large scales look random from the perspective of entropy at short scales, but look less and less random as we increase the scale. In other words, if we can’t see far or remember a lot, then things seem random, and as we see farther and remember more, structure is revealed. And what’s more is that the way randomness turns into structure at longer scales tells us many computational properties of the a system. Let’s see how this works.

Imagine we come across some mystery data:

...00110100100100110110110100110100100100110110110100100110110100...

Let’s assume that we don’t know how it was generated.

Notice that the frequency of 0 and 1′s is equal in our string.

So maybe it’s just a random string generated by a fair coin? After all, if we compare to the distribution of 0s and 1s of a string generated by a fair coin we get the same results.

That’s a good initial guess but let’s increase our scale and look at the length 2 blocks as well. By length 2 blocks we mean here the distribution over 00, 01, 10, and 11 in our data.

Notice that the 01 block occurs more often than the 00 block, so the statistics of our data are inconsistent with that of a fair coin at this larger scale; our data looks less random at this longer scale than it did at the length 1 block scale. Maybe it’s some kind of biased coin? Let’s look at how our mystery data compares to that of a ⁷⁵⁄₂₅ biased coin for block lengths 2.

Notice that in the length two blocks the biased coin has an asymmetry—the 00 and 11 blocks are not equal in frequency . This will be true of a string generated from coin flips with any bias, but is not true of our mystery string. Looking at the distribution over block length 3:

In our mystery string, the blocks 111 and 000 never occur! That’s impossible for any data generated by coin flips, no matter how biased.

If the data is completely random, then at any block length, all possible blocks will occur. If the data is a completely structured sequence of a certain period, then some blocks will never occur! Technically speaking, when considering the distribution of length $k$ blocks, random data will have nonzero frequency over all $2^{k}$ possible blocks, whereas for any data string that has a period of $N$ , if we look at the distribution of tokens of length $k \geq N$ there will be exactly $N$ tokens of length $k$ that occur with nonzero frequency.

The main point here is that deterministic sequences are associated with a constant distribution as you scale up block length, and random sequences are associated with $2^{N}$ occurring blocks.

Recall from the previous post how the mystery data was generated: you output 0, then 1, then you flip a fair coin, and then repeat. Our data was a mix of both a ‘random’ component and deterministic hidden structure. Recall that data is a mixture of ‘random’ components and deterministic hidden structure.^[3] Let’s look at the block distributions of the mystery data up to block length-9.

As block length increases the number of occurring blocks does not stay constant, nor does every possible block occur! It is a mix of both random components and deterministic structure.

There is also fractal-like behavior. If you focus on the columns of the histograms you will notice a pattern that repeats (at a different scale) every time you increase the block length by 3. This has something to do with renormalization and scaling laws and fractals, but for now let’s just leave it as a teaser.

Entropy Rates, Block Entropy Diagrams and Memory

Recall our claim that if we can’t see far or remember a lot, then things seem random, and as we see farther and remember more, structure is revealed. And what’s more is that the way randomness turns into structure at longer scales tells us many computational properties of the system. Here is the mystery string again.

...00110100100100110110110100110100100100110110110100100110110100...

Let’s look at how the entropy of these distributions scales as block token length increases.

Block Entropy Diagrams

Recall the basic intuition about entropy: entropy is a canonical measure of ‘negative information’. It measures the ‘irreducible uncertainty’. We want to look how this measure varies as we consider longer and longer block lengths (and therefore time-scales!)

Define the fundamental notion of block entropy:

$H (L) := - \sum_{s_{L} \in A_{L}} P (s_{L}) log P (s_{L})$

Here $A_{L}$ is the set of all length-L blocks $s_{L} = s_{L}^{i} = s_{i} . . . s_{i + L}$ and $P (s_{L})$ denotes the frequency of that block sequence in the data. The block entropy measures the irreducible uncertainy when predicting block sequences of length $L$ in the data set. For length $L = 1$ we recover the naive Shannon entropy defined above. We can plot the block entropy as block length increases.

The block entropy diagrams for different data: 01 and 0110111 are length-2 and length-6 sequences that are repeated deterministically .

This is called the block entropy diagram.

Again, we see that the data generated from coin flips are ever increasing straight lines, and the two repeated deterministic strings flatten out once you hit the sequence length.

Next define the block entropy rate at scale $L$ to be

$h_{μ} (L) := H (L) - H (L - 1)$

What does this expression measure? It is often a fruitful tactic to examine the asymptotics of mathematical functions to understand. So let us examine its large $L$ asymptotic limit.

Entropy rate

The entropy rate $h_{μ} = {lim}_{L \to \infty} h_{μ} (L)$ can be thought of as

Irreducible randomness: the randomness that persists even after accounting for all correlations over arbitrarily large blocks of tokens.
The randomness that cannot be “explained away”.
Expected uncertainty per symbol.

What does the entropy rate look like for real text data? Researchers have tried to estimate the entropy rate of English text. We hope you now have some inkling the subtlety of this question. For instance: what texts do we look at? In reality we cannot take the large $L$ limits so we have to pick some cutoff, et cetera. From a recent paper on the topic:

Information Theory’s father, C. E. Shannon showed that the entropy of printed English should be bounded between 0.6 and 1.3 bits/letter over 100-letter long sequences of English text. He used for his calculations a human prediction approach and ignored punctuation, lowercase, and uppercase. Cover and King, using gambling estimates, found a value between 1.25 and 1.35 bits per character. Teahan and Cleary reported a value for the entropy of English of 1.46 bits per character using data compression. Kontoyiannis reported a value of 1.77 bits per character using a method based on string matching, resembling that of universal source coding, based on a one million-character sample from a same author.

Block Entropy Rate and Intrinsic Memory

Let’s get back to the block entropy rate. Recall we defined the block entropy rate at scale $L$ to be

$h_{μ} (L) := H (L) - H (L - 1)$

We can think of $h_{μ} (L)$ as the average uncertainty of the next symbol, given that the previous $L - 1$ symbols have been observed. Said differently: if you had a memory of size $L - 1$ , your estimate of the entropy (inherent uncertainty) of the process would be $h_{μ} (L)$

The block entropy rate at length $L - 1$ can be thought of as the upper limit on how good your predictions can ideally be, with a memory of length $L - 1$ . Think of these as estimates of the true block entropy rate (taking $L \to \infty$ ), which is the actual irreducible entropy in the generating process.

^
Does the block entropy rate at scale $L$ equal the physical memory in the system in some sense? This is fraught. It is better to say that the block entropy rate measures the ‘apparent physical memory’. Some of the physical memory may be ‘encrypted’.

Entropy Scaling And Intrinsic Memory

Structure is found in how entropy scales

Entropy Rates, Block Entropy Diagrams and Memory

Block Entropy Diagrams

Entropy rate

Block Entropy Rate and Intrinsic Memory