This was a cool post! I was familiar with f-divergences as a generalization of KL divergence, and of course familiar with maxent methods, but I hadn’t seen the two put together before.
Problem: I’m left unsure when or why I would ever want to use this machinery. Sufficient statistics are an intuitive concept, and I can look around at the world and make guesses about where I would want to use sufficient statistics to model things. Independence is also an intuitive concept, and I can look around the world and make guesses about where to use that. Combining those two, I can notice places in the world where (some version of) KPD should bind, and if it doesn’t bind then I’m surprised.
But I don’t see how to notice places where max-f-divergence distributions should bind to reality. I can see where sufficient statistics should exist in the world, but that’s not a sufficient condition for a max-f-divergence distribution; there are (IIUC) lots of other distributions for which sufficient statistics exist. So why focus on the max-f-divergence class specifically? What intuitive property of (some) real-world systems nails down that class of distributions? Maybe some kind of convexity condition on the distribution or on updates or something?
A practical roadblock is that the above numerical results for inference are terribly slow to compute...
Not sure exactly what you’re doing numerically, but here’s how I usually handle vanilla maxent problems numerically (in my usual notation, which is not quite the same as yours, apologies for putting marginal translation work on you). We start with
maxentP[X] subject to E[f(X)]=μ
Transform that to
maxentP[X] subject to E[f(X)−μ]=0
This gives a dual problem for the partition function, which I’ll write in full:
minλln(∑xeλT(f(x)−μ))
That’s an unconstrained optimization problem, it’s convex in the happy direction, and standard optimizers (e.g. scipy using jax for derivatives) can usually handle it very quickly.
I would guess that process generalizes just fine to other f-divergences, since it’s basically just relying on convex optimization tricks. And that should yield quite fast numerics.
Your notation is clear to me! It can be shown that: Df(p⋆,q)=∫dxq(x)f⋄(f⋄#(⟨λ,T(x)⟩))
Even though f⋄ and f⋄# are both convex, their composition is not necessarily convex. Sorry, I don’t have any clear counterexamples at the ready. (The Hessian determinants all vanish in the graphene/bit-string model.) It’s likely that I’ve configured my numerical solvers badly for this example.
As far as binding to reality:
Power laws are ubiquitous in nature. Exponential family doesn’t admit power laws; f-div does. (It admits many other important non-exponential distributions as well—see the B-E example above.)
Symmetries in your data that are weaker than—or just different from—independence are ubiquitous in nature. Exponential family is only congruent with independence; not so, f-div.
KL / Gibbs entropy is inadequate for many observable systems. This typically appears under the heading nonextensive entropy (Gell-Mann & Tsallis w/ contributions by Touchette and others) or Tsallis entropy. This paper is far enough outside my field that I can only loosely validate it, but it’s an attempt to explain dark energy by way of applying observational data to determine Bayes factors of different nonextensive entropy models of our cosmological horizon.
I consider Jensen’s, DPI, and coarse-graining monotonicity to be pretty important desiderata (see Nielsen). f-div is the biggest family that supports these, uniquely determined.
[I could write several more posts worth of material]
This was a cool post! I was familiar with f-divergences as a generalization of KL divergence, and of course familiar with maxent methods, but I hadn’t seen the two put together before.
Problem: I’m left unsure when or why I would ever want to use this machinery. Sufficient statistics are an intuitive concept, and I can look around at the world and make guesses about where I would want to use sufficient statistics to model things. Independence is also an intuitive concept, and I can look around the world and make guesses about where to use that. Combining those two, I can notice places in the world where (some version of) KPD should bind, and if it doesn’t bind then I’m surprised.
But I don’t see how to notice places where max-f-divergence distributions should bind to reality. I can see where sufficient statistics should exist in the world, but that’s not a sufficient condition for a max-f-divergence distribution; there are (IIUC) lots of other distributions for which sufficient statistics exist. So why focus on the max-f-divergence class specifically? What intuitive property of (some) real-world systems nails down that class of distributions? Maybe some kind of convexity condition on the distribution or on updates or something?
Not sure exactly what you’re doing numerically, but here’s how I usually handle vanilla maxent problems numerically (in my usual notation, which is not quite the same as yours, apologies for putting marginal translation work on you). We start with
maxentP[X] subject to E[f(X)]=μ
Transform that to
maxentP[X] subject to E[f(X)−μ]=0
This gives a dual problem for the partition function, which I’ll write in full:
minλ ln(∑xeλT(f(x)−μ))
That’s an unconstrained optimization problem, it’s convex in the happy direction, and standard optimizers (e.g. scipy using jax for derivatives) can usually handle it very quickly.
I would guess that process generalizes just fine to other f-divergences, since it’s basically just relying on convex optimization tricks. And that should yield quite fast numerics.
Your notation is clear to me! It can be shown that:
Df(p⋆,q)=∫dxq(x)f⋄(f⋄#(⟨λ,T(x)⟩))
Even though f⋄ and f⋄# are both convex, their composition is not necessarily convex. Sorry, I don’t have any clear counterexamples at the ready. (The Hessian determinants all vanish in the graphene/bit-string model.) It’s likely that I’ve configured my numerical solvers badly for this example.
As far as binding to reality:
Power laws are ubiquitous in nature. Exponential family doesn’t admit power laws; f-div does. (It admits many other important non-exponential distributions as well—see the B-E example above.)
Symmetries in your data that are weaker than—or just different from—independence are ubiquitous in nature. Exponential family is only congruent with independence; not so, f-div.
KL / Gibbs entropy is inadequate for many observable systems. This typically appears under the heading nonextensive entropy (Gell-Mann & Tsallis w/ contributions by Touchette and others) or Tsallis entropy. This paper is far enough outside my field that I can only loosely validate it, but it’s an attempt to explain dark energy by way of applying observational data to determine Bayes factors of different nonextensive entropy models of our cosmological horizon.
I consider Jensen’s, DPI, and coarse-graining monotonicity to be pretty important desiderata (see Nielsen). f-div is the biggest family that supports these, uniquely determined.
[I could write several more posts worth of material]