A Correspondence Theorem in the Maximum Entropy Framework

Classical mechanics didn’t work any less well once we discovered quantum, Galilean relativity and Newtonian gravity didn’t work any less well once we discovered special and general relativity, etc. This is the correspondence principle, aka Egan’s Law: in general, to the extent that old models match reality, new models must reproduce the old.

This sounds like it should be a Theorem, not just a Law—a “correspondence theorem”.

This post presents such a theorem within one particular framework—maximum entropy. It’s not the theorem I’d eventually like—even my correspondence theorem from last month has more substance to it. But this one is simple, intuitive, and both its strong points and shortcomings help to illustrate what I want out of a correspondence theorem. (In fact, I wrote down this one before the other correspondence theorem, and the shortcomings of this theorem partially inspired that one.)

Background: Principle of Maximum Entropy

By “maximum entropy framework”, I mean roughly the technique used by Jaynes—see e.g. the widget problem in Logic of Science, page 440. If you’ve seen the principle of maximum entropy in the context of statistical mechanics, it’s the same thing. I expect a lot of people who are otherwise familiar with maximum entropy distributions are not familiar with this particular framework (especially outside of a physics context), so a bit of background is in order.

We’ll start with the widget problem: a company produces red, green and blue widgets, and for inventory purposes we wish to predict how many of each color will be sold on a given day. Based on sales aggregated over the past year, the average sales per day are roughly 40 green, 20 red, and 10 blue. Given only this information, what distribution should we assign to tomorrow’s sales?

In a high-school statistics class, the answer would be “insufficient information”—the averages are not enough info to figure out the whole distribution. But this isn’t a high-school statistics class. We’re Bayesians, we’ve received relevant information, we have to update somehow.

Jaynes argues that the canonically-correct way to do this is maximum entropy: we find the model which has the highest possible entropy, subject to the constraints , , and . A couple ways to interpret this idea:

  • The classical argument from stat mech: the vast majority of distributions are very close to the maximum entropy distribution. Equivalently, we implicitly have a (non-normalizable) uniform prior over distribution-space which we’re updating (via Bayes’ rule) based on the expections.

  • The argument from information: maximizing entropy means minimizing the information assumed in the model. By maximizing entropy subject to the expectation constraint, we’re accounting for the expectation but assuming as little as possible aside from that (in an information-theoretic sense).

Regardless of how we interpret it, the math works out the same. If our variable is X and our constraints are , then the maximum entropy distribution is

where

and is the minimizing argument. You can find derivations and discussions and all that in any stat mech book, or in Jaynes. For the widget problem above, would be the colors of each widget ordered in a day, would be the number of green widgets ordered, and the constraints would say , etc. To compute the ’s, we’d evaluate the integral analytically and then minimize (see Jaynes).

Anyway, the main point here is that we can now “update” on certain kinds of information by adding constraints to our maximum entropy problem. For instance, if we find out that the daily variance in green widget sales is 10, then we’d add a constraint saying . Our maximum entropy distribution would then have one additional and an additional term in the exponent. All written out, we’d go from

to

… and we’d have to solve the modified minimization problem to find and .

Conceptual takeaway: rather than updating on individual data points, in this framework we’re given a sequence of summaries of “features” of the dataset, of the form “expectation of is ” (found by e.g. computing the average of over a large data set). Each such feature becomes a new constraint in an optimization problem. This turns out to be equivalent to a Bayesian update in situations where a Bayesian update makes sense, but is more general—roughly speaking, it can work directly with virtual evidence updates.

Correspondence Theorem for Maximum Entropy Updates

On to the interesting part.

Let’s imagine that two analysts are (separately) building maximum-entropy models of some data. They each query the giant database for average values of certain features - for the first analyst, for the second. They end up with two models:

  • is the maximum-entropy model with features

  • is the maximum-entropy model with features

We’ll assume that both of these are “correct”, in the sense that the average values actually do match the data.

Let’s say that model 2 is “better” than model 1, in the sense that it has better predictive power on the real data-generating process : . (This is proportional to the average number of bits used by each model to encode a data point from .) So, the analysts’ boss plans to just use model 2. But let’s stretch the story to AI-alignment-style concerns: what if model 2 is using some weird ontology? What if the things the company cares about are easy to express in terms of the features , but hard to express in terms of the features ?

Now for the claim.

We have two possibilities. Either:

  • We can construct a third model which had strictly better predictive power than , OR

  • The features are already implied by ; those features are already “in there” in some sense.

The proof will show the sense in which this is true.

Proof

The obvious thing to do is combine the two models into a single maximum-entropy model M’ with both the features and . How does the predictive power of this model look?

For maximum-entropy models in general, the predictive power is has a simple expression, assuming the values are correct for (i.e. ):

… so it’s just the negative log of the normalizer . So, has higher predictive power than if-and-only-if .

Now, recall that comes from a minimization problem. Specifically:

Key thing to notice: the objective for is just the objective for with set to zero. In other words: the space which model searches for minima is a subset of the space which model searches for minima. Thus, is always at least as small as ; model has predictive power at least as high as .

Furthermore, let’s assume that the optima in these problems are unique—that’s not necessarily the case, but it is usually true in practice. (The objectives are convex, so uniqueness of the minimum can only fail in specific ways—I’ll leave the “fun” of working that out as an exercise to the reader.) We know that reduces to when ; if the optima are unique and then that means is indeed 0, so .

… but has to satisfy the constraints . So if , then also satisfies those constraints. That’s the sense in which the features are “already present” in : .

So, we have two cases:

  • : has strictly better predictive power (i.e.)

  • : the features from model 1 are already implicit in model 2 (i.e. )

What’s Really Going On Here?

If we strip away the math, the underlying phenomenon here seems kind of trivial.

The key is that we assume is correct for the true data-generation process . We justify this by imagining that we have some very large number of data points, so law of large numbers kicks in and we can correctly estimate averages of our features. We’re not just collecting noisy data points; we’re directly learning facts about the true distribution, and we’re learning those facts with perfect certainty.

So our theorem is saying something like… two models both contain some facts about the (observable) true distribution. Either:

  • we can combine them into a strictly better model which contains all the facts from both, OR

  • all the facts from one model are already contained in the other, and the combined model makes the same predictions as the “better” original model.

(A warning, however: this intuitive story is not perfect. Even if the combined model makes the same predictions as one of the original models about the data, it can still update differently as we learn new facts.)

Is This Trivial?

Let’s go back to the starting point: it all adds up to normality. New models need to reproduce the old models in all the places where the old models worked—otherwise the new models are strictly suboptimal.

The obvious-but-trivial formalization of this is that new models have to make the same predictions about the data as the old models, in all the places where the old models predicted correctly. Corollary: any features (i.e. functions) of the data correctly predicted by the old models must also be correctly predicted by the new models.

… and that’s basically what we’ve proven. Within the maximum entropy framework, any features of the data (specifically long-run average values) correctly predicted by an “old” model must also be correctly predicted by a “new” model, else the new model is strictly suboptimal. So in that sense, it seems pretty trivial.

However, there are two senses in which it’s nontrivial. First, in the case where the new model incorrectly predicts some feature-values encoded in the old model, we’ve explicitly constructed a new model which outperforms the old. It’s even a pretty simple, clean model—just another maximum entropy model.

Second, even the “trivial” idea that new models must make the same predictions about the data in places where the old model was right can cover some pretty nontrivial cases, because “features” of the data distribution can be pretty nontrivial. For instance, we can have a whole class of features of the form . With infinitely many such constraints, we can encode independence of and . After all, independence of two observed variables is a property of the data distribution, so it’s something we can use as a “feature”. Likewise for conditional independence. (Of course, once we get into infinite-feature territory we do need to be more careful about applying the theory to real, finite data sets...)