Maxent and Abstractions: Current Best Arguments

This post is not-very-distilled and doesn’t contain much background; it’s intended for people who already have the context of at least these four posts. I’m putting it up mainly as a reference for people who might want to work directly on the math of natural abstractions, and as a technical reference post.

There’s various hints that, in most real-world cases, the distribution of low-level state given high-level natural abstractions should take the form of a maximum entropy distribution, in which:

  • The “features” are sums over local terms, and

  • The high-level variables are (isomorphic to) the Lagrange multipliers

More formally: we have a low-level causal model (aka Bayes net) . Given the high-level variables , the distribution of low-level variable values should look like

… i.e. the maximum-entropy distribution subject to constraints of the form . (Note: , , and are all vector-valued.)

This is the sort of form we see in statistical mechanics. It’s also the form which the generalized Koopman-Pitman-Darmois (gKPD) theorem seems to hint at.

I don’t yet have a fully-satisfying general argument that this is the main form which abstractions should take, but I have two partial arguments. This post will go over both of them.

Maxent Telephone Argument

Two different nested layers of Markov blankets on the same underlying causal DAG

Quick recap of the Telephone Theorem: information about some variable passes through a nested sequence of Markov blankets . Information about can only be lost as it propagates. In the limit, all information is either perfectly conserved or completely lost. Mathematically, in the limit for some such that with probability approaching 1 as ; is the perfectly-conserved-in-the-limit information carrier.

In this setup, we can also argue that the limiting distribution should have a maxent form. (Note: this is a hand-wavy argument, not a proper proof.)

Think about how the distribution transforms as we increment by 1. We have

First key property of this transformation: it’s a convex combination for each value, i.e. it’s mixing. Mixing, in general, cannot decrease the entropy of a distribution, only increase it or leave it the same. So, the entropy of will not decrease with .

When will the entropy stay the same? Well, our transformation may perfectly conserve some quantities. Since the transformation is linear, those quantities should have the form for some , i.e. they’re expected values. They’re conserved when with probability 1.

Intuitively, we’d expect the entropy of everything except the conserved quantities to strictly increase. So, we’d expect the distribution to approach maximum entropy subject to constraints of the form , where with probability 1 (at least in the limit of large ). Thus, we have the maxent form

(Note on the in there: I’m actually maximizing relative entropy, relative to the prior on , which is almost always what one should actually do when maximizing entropy. That results in a term. We should find that is a conserved quantity anyway, so it shouldn’t actually matter whether we include the multiplier or not; we’ll get the same answer either way.)

Shortcomings of This Argument

Obviously it’s a bit handwavy. Other than that, the main issue is that the Telephone Theorem doesn’t really leverage the spatial distribution of information; information only propagates along a single dimension. As a result, there’s not really a way to talk about the conserved ’s being a sum over local terms, i.e. .

Despite the handwaviness, it’s an easy result to verify computationally for small systems, and I have checked that it works.

Resampling + gKPD Argument

Another approach is to start from the redundancy + resampling formulation of abstractions. In this approach, we run an MCMC process on our causal model. Any information which is highly redundant in the system—i.e. the natural abstractions—is near-perfectly conserved under resampling a single variable at a time; other information is all wiped out. Call the initial (low-level) state of the MCMC process , and the final state . Then we have

… where is conserved by the resampling process with probability 1.

It turns out that factors over the same DAG as the underlying causal model:

If the conserved quantities are much lower-dimensional than itself, then we can apply the gKPD theorem: we have a factorization of , we have a low-dimensional summary statistic which summarizes all the info in relevant to , so the gKPD theorem says that the distribution must have the form

… where is a relatively-small set of “exceptional” indices, and is some fixed reference value of . This is slightly different from our intended form—there’s the exception terms, and we have rather than just . The latter problem is easily fixed by absorbing into (at the cost of possibly increasing the summary dimension by 1), so that’s not really an issue, but the exception terms are annoying. Absorbing and assuming (for convenience) no exception terms, we get the desired form:

Note that this is maxentropic subject to constraints of the form . Since the summary statistic is conserved by the resampling process, we must have , so the conservation equation is

Shortcomings of This Argument

Obviously there’s the exception terms. Other than that, the main issue with this argument is an issue with the resampling approach more generally: once we allow approximation, it’s not clear that the natural abstractions from the resampling formulation are the same natural abstractions which make the Telephone Theorem work. Both are independently useful: information dropping to zero at a distance is an easy property to leverage for planning/​inference, and knowing the quantities conserved by MCMC makes MCMC-based planning and inference much more scalable. And in the limit of perfect conservation and infinite “distance”, the two match. But it’s not clear whether they match under realistic approximations, and I don’t yet have efficient methods to compute the natural abstractions both ways in large systems in order to check.

That said, resampling + gKPD does give us basically the result we want, at least for redundancy/​resampling-based natural abstractions.