Finding the Central Limit Theorem in Bayes’ rule
I feel like there’s something in the basic Bayesian update formula that suggests that, as the amount of input data grows, almost every Bayesian posterior will tend toward a Gaussian. This post is me trying to scrape at that vague intuition and get a handle on it. I’m going to list four observations, then put them together at the end to explain why I think this might be true.
#1: The central limit theorem can be viewed as a statement about convolutions
I covered this here, but a quick summary: a decent form of the Central Limit Theorem goes like this:
If you have a bunch of distributions (say, of them), and you convolve them all together into a distribution , then the larger is, the more will resemble a Gaussian (AKA Normal) distribution.
This is true for most component distributions, even if they’re not originally Gaussian, and it’s pretty robust to the shapes of the input distributions, even if those shapes are different. One example from that link:
(Although the CLT is often taught in the context of taking means of samples drawn from a population, in order to estimate the population mean, that’s not an important part of how I’m using it here, and that view of things is best forgotten while reading this post.)
#2: Convolutions have a relationship to the Fourier transform
The Fourier transform of a convolution of two functions is the pointwise product of their Fourier transforms. The inverse is true too, and more relevant to this post: convolving the Fourier transforms of two distributions is the same as multiplying them in the original domain. To say it shorter: Convolution in one domain is equivalent to multiplication in the other.
(The Fourier transform is a rich topic, and I myself don’t understand most of its behavior. If the same is true for you, don’t worry—all you need to know about it for this post is that it moves numbers from one space into another, it has an inverse, and the stuff in this section and the next.)
#3: The Fourier transform of the Gaussian is Gaussian
A table of some transforms. The left column is the original function , the right is the Fourier transform .
The Fourier transform of most functions looks pretty different than the original function. But in the middle of the table, conspicuous: the Gaussian doesn’t. Some action does happen: the t in the left column is an f in the middle column, and they aren’t equal, so they don’t have the same value—but they do have the same Gaussian form. (The site also has a proof that I didn’t read.)
#4: The Bayesian update for combining a prior distribution with a data distribution is a multiplication
Bayes rule is . is a normalizing factor, so when calculating the result of combining distributions P(H) with P(D|H) in practice, you can pretty much ignore it until the end, then find whatever number c makes the distribution sum to 1. Ignoring it here, then, the meat of the Bayesian update is , where means “is proportional to”.
When you’re combining many distribution via this form of Bayes’ rule, then, you end up with a lot of chained multiplications. If I have three datasets and a prior , then .
Putting it together
Okay. So. From section #2, convolving distributions in the original domain is the same as multiplying them in the Fourier domain. For almost all functions, these are pretty different operations, because the form of almost all functions in the Fourier domain is different than in the original domain. However, from #3, this isn’t true for the Gaussian—it’s special in that its form is the same in both domains. Therefore, for the Gaussian (and only the Gaussian), the chained Bayesian multiplications from #4 that give the posterior are somehow… similar… to the convolutions, in a way that I get stuck on when I try to pin down. Finally, since section #1 says the result of convolving many distributions together tends toward Gaussian the more distributions you convolve, each multiplication in the Bayesian update brings the posterior closer to Gaussian.
In terms of math operations
Take our three-datasets example from above. I’ll use to mean convolution, and write for , for , etc.
From #2, . We know from #1 that if there are many , the left-hand side will be Gaussian. The right-hand side has the same form as the Bayesian update, but on distributions in the Fourier space instead of the original space.
If we inverse-transform both sides, then the left side is . From #3, that’s Gaussian whenever is. Therefore: The left-hand side is Gaussian.
On the right side, is just , the original Bayesian update. Therefore: The right-side is the original Bayesian update.
Since the right and left sides are equal, and the left side is Gaussian, so is the right side. Therefore: The original Bayesian chained update is Gaussian.
Is this right?
I feel safe saying that this is at least directionally correct because I found this link, which says “[The posterior] distribution approaches the Normal distribution as n approaches infinity”, and this one, which says similar. Like so many statements of the CLT, they both say the input distributions must be identically distributed, but after looking into it I feel pretty safe ignoring that.
This is all pretty weird. I like thinking of my mind as working in terms of Bayesian updates, but does that mean the form of my uncertainty about things approaches Gaussian as I learn more? It’s actually weird enough that I worry I made some misstep in the math, so please let me know if you notice anything!
JBlack makes a strong challenge in the comments: sure, maybe distributions don’t need to be identical for them to converge under convolution—but that doesn’t mean that non-independence isn’t an issue. I haven’t looked at convolving dependent distributions at all in this series, and the chained Bayes update involves plenty of dependence. So that might break all of this. I might look into this in a later post.