KPD is a weak obstruction

Summary

I’ll begin with the notion of sufficient statistics: why they matter for bounded agents, the 90 year old obstruction one encounters, and why this isn’t so bad. Then I’ll present an alternative approach with proofs and exercises, following Probability Theory: The Logic of Science (Jaynes 2003) extensively.

In pt.1 I use a symmetry argument to show that KPD assumes too much. All the benefits of sufficient statistics can be realized without assuming any kind of independence.

In pt.2 I generalize Maxent to elicit a class of non-iid models exhibiting sufficient statistics that are compatible with pt.1 by construction. Taken together, this can be viewed in some sense as a representation theorem for finite exchangeability.

Pt. 0 -- preliminaries

Let’s start with a couple historical remarks on the notion of sufficient statistics, as introduced by RA Fisher (ca. 1920) in On the Mathematical Foundations of Theoretical Statistics.

That the statistic chosen should summarise the whole of the relevant information supplied by the sample. This may be called the Criterion of Sufficiency.

Not long after, three independent (and presumably identically distributed) authors, e.g. Koopman 1936, discovered an obstruction to sufficiency. Today this result is known as the Koopman-Pitman-Darmois (KPD) theorem.

According to the Pitman–Koopman–Darmois theorem, among families of probability distributions whose domain does not vary with the parameter being estimated, only in exponential families is there a sufficient statistic whose dimension remains bounded as sample size increases. Intuitively, this states that nonexponential families of distributions on the real line require nonparametric statistics to fully capture the information in the data.

Less tersely, suppose are independent identically distributed real random variables whose distribution is known to be in some family of probability distributions, parametrized by , satisfying certain technical regularity conditions, then that family is an exponential family if and only if there is a -valued sufficient statistic whose number of scalar components does not increase as the sample size increases.

At first blush, this seems to place a considerable restriction on an agent’s ability to compress data for the purpose of inference. Our brains summarize sensory data extensively. Does that mean we throw away most of the relevant data, or that we have representations of exponential families tucked away somewhere in our brains, or else what precisely? Once we accept KPD, we are stuck with the exponential family. Fortunately, the exponential family is actually pretty great!

  • Uniquely determined by KPD

  • Uniquely determined by Maxent

  • Fast parallelizable inference w/​ conjugate priors

  • Used nearly everywhere in physics

Yet I will argue that we can do better than KPD. We can extend Maxent. We can still do fast inference. Physics itself permits more general models. But first, I will motivate sufficiency as an essential ingredient for bounded agentic reasoning.

Why should an agent care about sufficiency?

A typical sufficient statistic will be the sample mean and variance. For data points storage is , whereas for sample mean and variance storage is . This is a dramatic difference for large datasets. For the purpose of this document I will call this discrepancy the agent compression problem. When reasoning about the environment, a bounded agent with access to lots of sensory data will greatly benefit from finding sufficient statistics. Almost every parametric model you can name uses a sufficient statistic. Consider the task of measuring a few thermodynamic variables like temperature, pressure, and chemical potential vs. trying to track the individual motions of molecules. Lest I’ve undersold the importance and ubiquity of this idea, though it’s rarely (never?) described this way, the Standard Model is essentially a catalog of sufficient statistics for elementary particle interactions, i.e. sufficient statistics that we’ve experimentally uncovered and/​or proposed on the basis of symmetry (or symmetry-breaking) arguments. From scattering cross-sections to thermodynamic equations of state, physicists have spent a century and a half developing fine-grained microscopic models from coarse-grained macroscopic measurements.

What is sufficiency, really?

Jaynes (ch. 8) introduces the notion of sufficiency, with the following observations:

  • Properly an invariance of the posterior: ”...given [the sufficient statistic] alone we should be able to draw the same inferences about [a model parameter] …”

  • Orthogonal to assumptions about sampling characteristics: “Note that we are not assuming independent or exchangeable sampling here...”

  • Produces a kernel condition on the prior (Jaynes 8.5). Fisher sufficiency is a special case of vanishing kernel, because it has no notion of a prior.
    ...use of a particular prior may make certain particular aspects of the data irrelevant. Then a different prior may make different aspects of the data irrelevant. One who is not prepared for this may think that a contradiction or paradox has been found.”

  • Only vanishing kernel implies Fisher-Neyman factorization (Jaynes 8.9), which is the starting point of most proofs of KPD.

Provided vanishing kernel, for a statistic sufficient for , the conditional distribution satisfies Fisher-Neyman (F-N) factorization:

Yet KPD makes restrictive assumptions about sampling characteristics and makes no use of the prior or posterior. Based on the above caveats, we should expect KPD to fail to capture the essence of sufficiency. I suspect KPD is a poor fit because it does not approach the agent compression problem directly—let’s explore that!

Pt. I—sufficiency from symmetry

Let’s approach agent compression directly by describing an “inference-only” agent, i.e. one that is not intervening on its inputs, that is considering the following propositions:


D = “receive data—from some operationalized process—in the form of elementary outcomes
E = “likewise
Note: D vs E is a logical boundary, not necessarily a causal or temporal boundary, e.g.

  • D corresponds to the first run of an experiment, E corresponds to the second run of the same experiment

  • D corresponds to the kth flips of a coin with k prime, E corresponds to the kth flips of a coin with k composite

  • D corresponds to a simple microscopic model of the magnetic dipoles of a particular cube of pure nickel in thermal equilibrium with its environment at some particular temperature, within some particular magnetic field, etc. E corresponds to the same degrees of freedom in a different cube of pure nickel under the same conditions

A is some model proposition, e.g.

  • : “in spite of any other evidence, the long-run frequency of Heads under these experimental settings is f”

  • : “in spite of any other evidence, the probability of Heads under these experimental settings is p”

  • : “this sensory input contains a cat”

T(D) = “the output of program T acting on input D”

Define Bayes sufficiency, i.e. “T(D) is Bayes sufficient for A given D”:

Note that this does not require A to be a differentiable parameter (or even a parameter at all). The posterior is itself Bayes sufficient. To wit,

This last equality is evident in Jaynes’ exploration of sufficiency, wherein he alludes to the agent compression problem.

...the only property of E which the robot needs in order to reason out the effect of new information is this density . Everything that the robot has ever learned which is relevant to proposition A may consist of millions of isolated separate facts. But when it receives new information, it does not have to go back and search its entire memory for every little detail of its information relevant to A. Everything it needs in order to reason about A from that past experience is contained summarized in this one function, (Jaynes p. 557)

For sufficiency to hold for any , it must follow that:

It should be clear from this expression that—without further assumptions—an agent needs to retain D to compute , , and so on. Sufficiency alone does not solve the compression problem.

Whence KPD?

One way to view KPD is that it makes progress by implicitly assuming a symmetry called infinite exchangeability. To build up to that, first consider the following symmetry under the permutation group—here denoted -- otherwise known as finite exchangeability:

Exercise: show that this relation implies the are identically distributed.

Infinite exchangeability is the extension of finite exchangeability to any finite symmetric subgroup, in the limit . Representation theorems, due to de Finetti 1937 and others, establish conditions in which infinite exchangeability is equivalent to a mixture of iid. For a survey of various forms of exchangeability, see Diaconis 1988.

Together with F-N factorization—modulo proof-specific technical assumptions—infinite exchangeability entails KPD. There are many proofs. What they appear to have in common is a step where the independence assumption is used with a logarithm to break a product of independent probabilities into a sum. Then the identical distribution property is used to argue that the statistic can also be broken into a sum. Matching these expressions yields the exponential family. But one can represent nearly any probability distribution as a sum, by application of Kolmogorov-Arnold. The logarithm is perfectly suited only to independent distributions. This should give us some intuition that KPD’s independence assumption is unnecessarily restrictive.

Time for T

I will make a slightly different assumption, informed by Bayes sufficiency—one that I think is more natural as it involves the sufficient statistic directly:

A pedagogical note. This assumption differs from exchangeability in that inference on A, given D—rather than the plausibility of D itself—is invariant to :

Due to its permutation symmetry, T can be represented by a particularly nice class of functions. Namely, power sum symmetric polynomials (essentially sample moments in this context) which form a basis of universal approximators via the Stone-Weierstrass theorem. See the Appendix of Deep Sets for a detailed discussion. There the authors show that permutation symmetry permits an invertible embedding of D (n dim) into (n+1 dim). From there, a simple way to achieve compression is to assume that inference on A depends only on some subspace, say polynomials up to power m (). . A more principled approach to compression will be explored in pt.2. With those ingredients in mind:


Now, so long as inference for A is the only goal, the agent is able to discard D in favor of T(D). This leads to the same update power as conjugate priors for the exponential family, apparently without any restrictions on model family.

Note that in thermodynamic settings, is called extensivity.

Low-effort example

Let’s try a simple illustration when T(D) are power sums:

Next steps

That’s it. Start with Bayes sufficiency, use the representation theorem implied by permutation symmetry, compress, evade KPD. But what about Maxent? Doesn’t that also justify exponential families? Jaynes recasts sufficiency in these terms:

...if we measure information in terms of entropy, then zero information loss in going from the full data set D to a statistic r is equivalent to sufficiency of r… (footnote to Jaynes p. 247)

Evidently, sufficiency has something to do with information entropy. As Maxent does not make an iid assumption (see Jaynes eq. 14.58), it provides a more generic justification for exponential families. Let’s explore these ideas more in the next section.

Pt. II—constructive non-iid models

Much of this section reflects a wider effort I undertook many years ago to generalize Maxent. These results were developed independently from pt.1 and as such do not depend on exchangeability. Instead they will reflect the notion of sufficiency in the sense of information entropy. I will offer a construction of models that simply make use of exchangeability, for the purpose of illustrating the compatibility and practical applicability of these approaches to agent compression.

f-divergence basic properties

Beginning with two probability distributions—p and q—and a convex function f, with , define the f-divergence:

Discrete and path integral analogues may be formulated mutatis mutandis. Beware that there are conflicting conventions in the literature—sometimes f-divergence is defined with opposite sign or with arguments reversed. When p and q are in the same parametric family, then through an abuse of notation one also writes:

Following section 3.10 of Nielsen 2018, the f-divergence shares key desiderata with Kullback–Leibler divergence (KL). Namely Jensen’s Inequality, the Data Processing Inequality, invariance under 1:1 transformations, etc.

Exercise: prove 1 or 2 of the above properties.

For well-behaved convex functions, it’s useful to define two involutions ( and ):

Exercise: show that these operations are involutions on the space of convex functions.

Without loss of generality and because it makes some of the math simpler, I tend to use so-called “standard” f-divergences, satisfying (note for the last relation there’s a typo in Nielsen):

Exercise: prove the well-known identity for convex conjugates:

Exercise: show that:

Observe that KL is a special case of f-divergence, ordinarily defined with

Exercise: show that KL in “standard” form is

Lagrangians are neat!

Here is a quick review of the Maxent solution (Jaynes 11.6) via the method of Lagrange multipliers (recall that ):

Exercise: convince yourself that this is an exponential family in the so-called “canonical form”.

This result is what I mean when I say Maxent justifies the exponential family. Exponential families are the extreme points of KL. The extrema of f-divergences share key properties with the extrema of KL, in particular the form of F-N. I’ll show this now.

Beware I’m skipping some details such as using KKT conditions to restrict the feasible set to , which you got “for free” in the exponential family solution. Nevertheless, following the same method as above, but now for f-divergences:

Exercise: show that because and , we can write:

By inspection, this is in the form of F-N, so it automatically reflects Bayes sufficiency:

Exercise: show this.

Remarks on

Now that we have a solution to our Lagrangian, we can return to the question of compression. Start with any fixed size representation of T. Standard arguments from convex optimization show how a redundant statistic will become a slack condition, eliminating that constraint automatically through the vanishing of its Lagrange multiplier () -- see Jaynes pp. 369-370. In physics, this phenomenon underlies universality. Although a lifetime worth of complications arise in continuum models, techniques like renormalization group (RG) inform particular procedures to enable one to argue that all but finitely many terms vanish. On the other hand, unaccounted statistics will cause recognizable errors in reasoning, which has historically led to new postulates such as Planck’s energy quantization—to explain blackbody radiation—or the Anderson-Higgs mechanism.

When is permutation invariant, models as identically distributed, but not independent. (It’s possible to relax the identical distribution property by assuming a weaker symmetry known as partial exchangeability. Jaynes eq. 14.58 exhibits the essential idea.) Independence holds exclusively in the KL case. But KL never gives power laws or heavy tails which are in reality ubiquitous. An f-divergence readily produces power laws.

The f-divergence based on Amari -divergence has extrema that are power laws:

This last expression is sometimes called the q-exponential (). -divergences lead to useful applications and insights in information geometry and machine learning, e.g. they provide a geometric interpretation of the expectation-maximization (E-M) algorithm.

Exercise: show that

Exercise: show that

Here’s a fun example that demonstrates an f-divergence of purely combinatoric origin. If you follow the Wallis derivation (Jaynes 11.4) of entropy, but instead of classical quanta you perform the thought experiment with bosons, as I recall you end up with a non-standard f-divergence with extrema mimicking Bose-Einstein statistics:

Exercise: show that the “standardized” form satisfies:

Same macro, different micro

Let’s see f-divergence in action. Consider the following thought experiment optional—you can skip ahead to contemplating bit strings if you like. Imagine that you have a big rock primarily composed of two isotopes of carbon, carbon-12 and carbon-14, in roughly 1:1 proportion. You also have a black box. When you put a chunk of your rock in the box, it ejects some sludge and then prints out a sheet of graphene: perfectly rectangular, one atom thick, and without defects. You look at it under a microscope and see a hexagonal lattice (side length ) of carbon atoms, but your microscope can’t distinguish the carbon-12 from the carbon-14. Because of this—and because it might take months to scan your microscope over the entire sheet, counting hexagons—you decide to make some macroscopic measurements. First you measure the area of the sheet , which lets you calculate the number of carbon atoms . Second you measure the mass of the sheet , which will help you calculate the proportion of each isotope. Let be the proportion of carbon-12, then . Suppose you look up the masses of each isotope in a table and, based on your measurements, use those values to calculate unexpectedly that . After further testing—smaller chunks, larger chunks, cutting, balancing, electrifying, and so on—you convince yourself based on the apparent translation symmetry of the lattice that each unit cell should be statistically identical. So if you let represent the number of carbon atoms per unit cell, and represent the number of carbon-12 atoms per unit cell, you can picture your graphene sheet as an ensemble of hexagons with . Now you are curious, but your apparatus haven’t told you, what is the isotopic composition of the unit cells produced by the black box? This motivates the following model:

Imagine that you have access to a bit string:

With the property:

For a finite state space one can always make the choice of uniform reference measure:

There is a physical intuition that this corresponds to: provided that whatever process is happening in the box is unable to differentiate the carbon isotopes when forming graphene, the proportion of isotopes in the output will be the same as it was in the input, 1:1. That this has been experimentally disconfirmed is why there is any non-trivial inference left to do.

Exercise: show that if the experiment had instead confirmed , all of the models would be trivial, i.e. starting with any f-divergence:

Let . Canonically, the binomial distribution is expressed in terms of the pushforward measure rather than . In other words, it’s treated as a distribution acting on rather than a distribution acting on . This treatment is taken so much for granted that you’ll typically see the function presented as the definition of the binomial distribution. That definition is only sensible in the first place because is a sufficient statistic… smuggling in exchangeability after all. We will adopt that convention here:

From here it is easy to proceed. Pick an f. Use your favorite numerical solver to find . That’s your model.




Exercise: check for each model that, as expected:

Now the question of whether any of these models is correct is in principle testable. Build (or apply for beamtime on) a device that can distinguish the isotopes in a unit cell. For example, neutron scattering might be a good technique for this. Once you can distinguish the isotopic configurations with your measurements, you can distinguish your models, because even though they all match your macroscopic measurements, they make remarkably different microscopic predictions . For comparison, here are the values for and :

Because the extrema of f-divergences are not characterized by independence, they should have novel application to source coding, large deviations theory, Solomonoff induction, in fact anywhere that Shannon entropy or KL already appears… just to name a few ideas. A practical roadblock is that the above numerical results for inference are terribly slow to compute, not to mention unsatisfying. I will address this in a future post. (Hint: the answer is something like Feynman diagrams, but worse!)

No comments.