Resampling Conserves Redundancy (Approximately)

Suppose random variables and contain approximately the same information about a third random variable , i.e. both of the following diagrams are satisfied to within approximation :

“Red” for redundancy

We call a “redund” over , since conceptually, any information contains about must be redundantly represented in both and (to within approximation).

Here’s an intuitive claim which is surprisingly tricky to prove: suppose we construct a new variable by sampling from , so the new joint distribution is

By construction, this “resampled” variable satisfies one of the two redundancy diagrams perfectly: . Intuitively, we might expect that approximately satisfies the other redundancy diagram as well; conceptually, (approximately) only contains redundant information about , so contains (approximately) the same information about as does, so the resampling operation should result in (approximately, in some sense) the same distribution we started with and therefore (approximately) the same properties.

In this post, we’ll prove that claim and give a bound for the approximation error.

Specifically:

Theorem: Resampling (Approximately) Conserves (Approximate) Redundancy

Let random variables , satisfy the diagrams and to within , i.e.

Also, assume .

Construct by sampling from , so . Then is perfectly satisfied by construction, and is satisfied to within , i.e.


In diagrammatic form:

(Where O(Ɛ) in this version of the proof is 9Ɛ, specifically.)

Notation

We will use the shorthand to mean . For instance, is shorthand for , which is equivalent to .

We will work with nats for mathematical convenience (i.e. all logarithms are natural logs).

Proof

The proof proceeds in three steps:

  • From , we can construct a new variable which is equal to with probability and is otherwise a constant. Then the errors on all diagrams in the theorem for are the same as the corresponding errors for , but scaled down by factor . This allows us to focus only on the regime of arbitrarily small errors, and obtain a global bound from that regime.

  • Within the regime of arbitrarily small errors, is approximately (to second order) equal to twice the Hellinger distance, . Hellinger distance is a straightforward Euclidean distance metric, so we can use standard geometric reasoning on it.

  • The Hellinger distance between and , between and , and between and are all small, so end-to-end the Hellinger distance between and is small. That yields the bound we’re after.

Step 1: Scaling Down The Errors

First we construct the new variable as a stochastic function of . Specifically, with probability , else is a constant , where is outside the support of (so when we see , we gain no information about ).

A little algebra confirms that ‘s errors are simply ’s errors scaled down by :

Similarly, constructing just like (i.e. ) is equivalent to constructing as a stochastic function of where with probability , else is . So, by the same algebra as above,

The upshot: if there exists a distribution over variables for which

then there also exists a distribution satisfying the same inequality with all ‘s arbitrarily small[1]. Flipping that statement around: if there does not exist any distribution for which the ’s are all arbitrarily small and the inequality is satisfied, then there does not exist any distribution for which the inequality is satisfied.

In other words: if we can show

in the regime where all the ’s are arbitrarily small, then the same inequality is also established globally, proving our theorem. The rest of the proof will therefore show

in the regime where all the ‘s are arbitrarily small. In particular, we’ll use a second order approximation for the ’s.

Step 2: Second Order Approximation

Validity

Before we can use a second order approximation of the ’s, we need to show that small implies that second order approximationis valid.

For that purpose, we use the Hellinger-KL inequality:

where is the squared Hellinger distance.[2]

Using standard logarithm inequalities, we can weaken the Hellinger-KL inequality to

So, as goes to 0, the Hellinger distance goes to 0, and therefore and are arbitrarily close together in standard Euclidean distance. Since is smooth (for strictly positive distributions, which we have assumed), we can therefore use a second order approximation (with respect to ) for our arbitrarily small ’s.

Expansion

Now for the second order expansion itself.

Our small quantity is . Then

To simplify that further, we can use the sum-to-1 constraints on the distributions: implies

so . That simplifies our second order approximation to

i.e. in the second order regime is twice the Hellinger distance.

Combining this with Step 1, we’ve now established that if we can prove our desired bound for Hellinger distances rather than , then the bound also applies globally for errors. So now, we can set aside the notoriously finicky KL divergences, and work with good ol’ Euclidean geometry.

Step 3: Good Ol’ Euclidean Geometry

Writing everything out in the second order regime, our preconditions say

and we want to bound

That last expression has a Jensen vibe to it, so let’s use Jensen’s inequality.

Jensen

We’re going to use Jensen’s inequality on the squared Hellinger distance, so we need to establish that squared Hellinger distance is convex as a function of the distributions .

Differentiating twice with respect to yields the Hessian

Note that one column is the other column multiplied by , so one of the eigenvalues is 0. The trace is positive, so the other eigenvalue is positive. Thus, the function is (non-strictly) convex.

Now, we’ll use Jensen’s inequality on

Specifically:

  • We’ll hold constant, so it’s not involved in our application of Jensen.

  • For each value, the expression takes a convex combination with weights of the function .

  • Viewing at fixed as the function’s inputs, the function is convex.

So, applying Jensen’s, we get

Euclidean Distances

With that, we have bounds on three (squared) Hellinger distances:

So, on average over :

  • The Euclidean distance between and is at most .

  • The Euclidean distance between and is at most .

  • The Euclidean distance between and is at most .

So, on average, the Euclidean distance from end-to-end, between and , is at most .

That gives us the desired bound:

implying

Combined with our previous two sections, that establishes the desired upper bound of on .

Empirical Results and Room for Improvement

Below is a plot of the maximal error achieved via numerical minimization of subject to a constraint on, searching over distributions of X and . Above, we proved that the ratio of those two quantities can be no higher than . As expected from the proof, it is visually clear that each point on the curve lies on a line between itself and the origin which itself always lies below the curve. Some, presumably, noise is present due to occasional failures of the optimizer to find the maximum error.

Zooming in on the steepest part of the curve and eyeballing the plot, it looks like the maximum ratio achieved is around 4 (.02/​.005), implying an empirical upper bound of the resampled diagram of ~8:

Looking into the actual solutions found, the solutions with ratio of ~4 involve one of the two terms in the x-axis sum being much larger than the other (5-10x). Therefore we expect to be able, in principle, to get a tighter bound (~4, empirically, rather than the proven 9 or empirical 8.) The most likely place for improvement in the proof is to bound the Hellinger distance between and directly by , cutting one step out of the “path”, and that would indeed reduce the bound from 9 to 4. We’ll leave that for future work.

Interesting additional piece for future reference: If we include the mediation condition into the denominator, and so look for a bound in terms of a factor of the sum of all natural latent condition epsilons, we find that the empirical factor in question is 1 (roughly. Not sure what happened at ~0.4):

  1. ^

    Note that, since by assumption, none of the ’s are infinite. This is the only place where we need ; that assumption can probably be eliminated by considering the infinite case directly, but we’re not going to do that here.

  2. ^

    A quick aside: while it might look messy at first, the Hellinger distance is a particularly natural way to talk about Euclidean distances between probability distributions. In general, if one wants to view a distribution as a vector, is the most natural vector to consider, since the sum-to-1 constraint says is a unit vector under the standard Euclidean distance.