Conditional On Long-Range Signal, Ising Still Factors Locally

Background: The Ising Model

The Ising Model is a classic toy model of magnets. We imagine a big 2D or 3D grid, representing a crystal lattice. At each grid vertex , there’s a little magnetic atom with state , which can point either up () or down (). When two adjacent atoms point the same direction, their joint energy is lower than when they point different directions; atoms further apart don’t directly interact. So we write the energy function (aka Hamiltonian) as for some constant ; two adjacent atoms with the same direction contribute energy, while two adjacent atoms with different directions contribute energy.

Read: Blue → −1 or “cold” or “down” , Red → +1 or “hot” or “up”

With the energy function in hand, the Boltzmann distribution specifies the distribution of states for temperature . Crucially, this distribution factors locally: we can write it as a product of terms which each depend on only two adjacent atom-states, i.e. . In the terminology of graphical models, Ising is an undirected graphical model, aka a Markov random field; it “factors over” a graph matching the grid. That factorization is key for representing the distribution efficiently, and unlocks lots of standard algorithms and techniques for working with the distribution.

Each magnetic is represented by a node , and is adjacent to 4 immediate neighbors.

At low temperatures, most of the atoms will lock in to point the same direction. That’s a long-range signal: observing the consensus direction in one little chunk of the grid gives us a pretty confident estimate of the consensus direction in another chunk far away. Other than that signal, far-apart chunks are independent.

That’s all standard material; now for the new and interesting claim.

Hot Claim: Conditional On Long-Range Signal At Low Temperature, Ising Approximately Factors Over The Same Graph

This is cool because it means we can condition on that long-range signal—i.e. the consensus direction—and still use the graphical factorization for efficient representation, standard algorithms, etc.

In the particular case of Ising, there’s also a neat physical interpretation: conditioning on long-range signal is approximately equivalent to a weak external magnetic field. We won’t focus much on that interpretation because we’re not that interested in magnets per se; we’re mainly interested in generalization to other graphical models.

Proof Sketch

First Piece: Using Maxent To Pick Out A Mode (In General)

Conceptually, at low temperatures, the distribution of states \sigma is very bimodal: there’s one mode where most atoms point down, and one where most point up, and these two modes are very well separated.

Conditioning on the long-range signal means picking out one of the two modes—e.g. if we condition on the consensus direction being up (+1), then our posterior distribution will have all its weight on just one of the two modes.

So with that in the back of our minds, let’s first walk through a simpler 1-dimensional problem.

Let be a scalar, and let be a 5050 mixture of two unimodal distributions and , with very well separated modes and very little overlap, e.g.:

A 5050 mixture of two well-separated unimodal distributions.

Now, let’s think about the maxent problem

maxent relative to subject to

… where we choose , i.e. the mean must match the second of the two mixture components.

By the usual maxent machinery, the solution will have the form

for some scalar and normalizer . (We’re using the inverse of the usual lagrange multiplier here because it’s easier to interpret.)

Now, let’s consider two rough quantities associated with our bimodal mixture distribution :

  • Distance between the two peaks, which we’ll call

  • Width of the peaks, which we’ll call (if one is much wider than the other, then take to be the wider width)

By assumption, the two distributions are well separated, i.e. .

To get an approximate solution to our maxent problem, pick some at a middle scale between and , i.e. . Then:

  • Within one peak centered at , varies roughly between , so varies roughly between . But , so the entire range of is approximately 0; is therefore approximately constant (especially on a log scale) across one peak.

  • By similar reasoning, between peaks, ranges by roughly . But , so the variation between peaks will be exponentially large; thus the exponential majority of the weight will be on just one peak.

Now put those two together:

  • of the right order of magnitude (and sign) can put the exponential majority of weight on peak 2…

  • … and within peak 2, will be approximately constant, so (with normalization) the distribution approximately just matches

  • … and therefore the constraint is approximately satisfied, i.e. indeed approximately solves our maxent problem.

Takeaway: when the two peaks of are well separated (i.e. distance between them large relative to their widths), we can find a distribution of the form which approximately matches the distribution of just one peak.

Second Piece: Using Maxent To Pick Out A Mode (Ising at Low Temp)

Going back to the Ising model, let’s consider .

Within each peak, the consensus direction is effectively “given”, so far-apart chunks of the grid are independent. So, the central limit theorem will kick in, and will be normal—again, within each peak, so the two separate peaks will be two separate normal distributions.

With atoms in the grid, the magnitude of will scale like , with the two peaks having opposite signs. The standard deviation (i.e. width of the peak) will scale like . Peak separation is much larger (for large ) than peak width , so we can apply the trick from the previous section.

Upshot: as long as we choose satisfying , we will find that approximately matches . The term will be approximately-constant over the consensus-direction-up peak, and will differ exponentially between the two peaks, so that nearly all the weight is on the up-peak. (By taking of the opposite sign, we can similarly make match .)

Third Piece: Approximate Factorization Yay!

So we have (for some handwavy notion of approximation which we never actually defined). The last step is just to substitute in the factored form of and write everything in an explicit factored form:

… and there we have it.

Why Is This Interesting?

A central subproblem of natural abstraction is, roughly, how to handle the low-level conditional on the high level.

The prototypical mental picture is very roughly:

  • Agent gets some raw data, and hooks it into the world model at a low level to be updated-upon.

  • That data is from a spacetime-localized chunk of the world, because sensory organs are spacetime-localized.

  • So, for purposes of propagating the update to most of the rest of the world model, only a relatively small “abstract” summary is relevant; that triggers an update of the high level, which tracks longer-range signals.

  • Later on, when making localized low level predictions someplace else, all the relevant information from far-away observations is summarized by the high level.

… but in order for that mental picture to make any sense, the low level processing at the first and last step has to be local—i.e. it needs to not involve unpacking a low level model of the entire world.

Concretely, for an Ising model, the idea would be:

  • Agent observes a little chunk of a big Ising system.

  • That data is from a localized chunk.

  • For purposes of propagating the update to the rest of the world model (i.e. Ising system), only the consensus direction is relevant. That triggers an update of the high level, which tracks (the agent’s best guess of) the consensus direction.

  • Later on, when making predictions about some other chunk of the system, the agent uses its high-level estimate of the consensus direction, without thinking about the low level details of most of the rest of the system.

… but again, this only works insofar as the first and last step can work with a little chunk of the low level system without having to unpack the entire system at a low level. That requires some kind of locality, which is what the hand-wavy proof argues for.

Even with that locality established, we don’t yet have all the pieces to follow the conceptual story. But it gives us any foothold at all.

Of course, the real hope is to find extremely general techniques, which work far beyond Ising. And the hand-wavy proof above is reasonably general: it just requires that the long-range signal distribution has well-separated peaks, with peak spacing much larger than width. In other words, it’s a form of scale separation. That applies to an awful lot of systems… but importantly not all systems.

Insofar as the scale separation trick fails, what should we do instead? Well, presumably we need to leverage some kind of structure other than undirected graphical structure conditional on the long-range signal. Figuring out that structure is the real hot goal here, and to that end two major takeaways are:

  • it’s probably not just undirected graphical factorization, but...

  • it probably does reduce to undirected graphical factorization when the appropriate scale separation is present.