All Gaussian distributions have kurtosis 3, and no other distributions have kurtosis 3. So to check how close a distribution is to Gaussian, we can just check how far from 3 its kurtosis is.
This is wrong. kurtosis is just the expectation of the 4th power. (Edit: renormalized by expectations of the first and second power) All sorts of distributions have kurtosis 3. Like for example the discrete distribution over [-1,0,0,0,0,1]
I’m not sure Kurtosis is the right measure for how Gaussian is in practice. I would have been more interested in the absolute difference between the distributions. That might be more difficult to compute though.
To that point, skew and excess Kurtosis are just two of an infinite number of moments, so obviously they do not characterize the distribution. As someone else here suggested, one can look at the Fourier (or other) Transform, but then you are again left with evaluating the difference between two functions or distributions: knowing that the FT of a Gaussian is a Gaussian in its dual space doesn’t help with “how close” a t-domain distribution F(t) is to a t-domain Gaussian G(t), you’ve just moved the problem into dual space.
We have a tendency to want to reduce an infinite dimensional question to a one dimensional answer. How about the L1 norm or the L2 norm of the difference? Well, the L2 norm is preserved under FT, so nothing is gained. Using the L1 norm would require some justification other than “it makes calculation easy”.
So it really boils down to what question you are asking, what difference does the difference (between some function and the Gaussian) make? If being wrong (F(t) != G(t) for some t) leads to a loss of money, then use this as the “loss” function. If it is lives saved or lost use that loss function on the space of distributions. All such loss functions will look like an integral over the domain of L(F(t), G(t)). In this framework, there is no universal answer, but once you’ve decided what your loss function is and what your tolerance is you can now compute how many approximations it takes to get your loss below your tolerance.
Another way of looking at it is to understand what we are trying to compare the closeness of the test distribution to. It is not enough to say F(t) is this close to the Gaussian unless you can also tell me what it is not. (This is the “define a cat” problem for elementary school kids.) Is it not close to a Laplace distribution? How far away from Laplace is your test distribution compared to how far away it is from Gaussian? For these kinds of questions—where you want to distinguish between two (or more) possible candidate distributions—the Likelihood ratio is a useful metric.
Most data sceancetists and machine learning smiths I’ve worked with assume that in “big data” everything is going to be a normal distribution “because Central Limit Theorem”. But they don’t stop to check that their final distribution is actually Gaussian (they just calculate the mean and the variance and make all sorts of parametric assumptions and p-value type interpretations based on some z-score), much less whether the process that is supposed to give rise to the final distribution is one of sampling repeatedly from different distributions or can be genuinely modeled as convolutions.
One example: the distribution of coefficients in a Logistic model is assumed (by all I’ve spoken to) to be Gaussian (“It is peaked in the middle and tails off to the ends.”). Analysis shows it to be closer to Laplace, and one can model the regression process itself as a diffusion equation in one dimension, whose solution is … Laplace!
I can provide an additional example, this time of a sampling process, where one is sampling from hundreds of distributions of different sizes (or weights), most of which are close to Gaussian. The distribution of the sum is once again, Laplace! With the right assumptions, one can mathematically show how you get Laplace from Gaussians.
The Berry-Essen theorem uses Kolmogorov-Smirnov distance to measure similarity to Gaussian—what’s the maximum difference between the CDF of the two distributions across all values of x?
As this measure is on absolute difference rather than fractional difference it doesn’t really care about the tails and so skew is the main thing stopping this measure approaching Gaussian. In this case the theorem says error reduces with root n.
From other comments it seems skew isn’t the best measure for getting kurtosis similar to a Gaussian, rather kurtosis (and variance) of the initial function(s) is a better predictor and skew only effects it inasmuch as skew and kurtosis/variance are correlated.
I thought about that but didn’t try it—maybe the sum of the absolute difference would work well. I’d tried KS distance, and also taking sum(sum(P(x > y) over y) over x), and wasn’t happy with either.
I think I didn’t like the supremum part of the KS distance (which it looks like Total Variation has too) - felt like using just the supremum was using too little information. But it might have worked out anyway.
This is wrong. kurtosis is just the expectation of the 4th power. (Edit: renormalized by expectations of the first and second power) All sorts of distributions have kurtosis 3. Like for example the discrete distribution over [-1,0,0,0,0,1]
Otherwise an interesting post.
I’m not sure Kurtosis is the right measure for how Gaussian is in practice. I would have been more interested in the absolute difference between the distributions. That might be more difficult to compute though.
To that point, skew and excess Kurtosis are just two of an infinite number of moments, so obviously they do not characterize the distribution. As someone else here suggested, one can look at the Fourier (or other) Transform, but then you are again left with evaluating the difference between two functions or distributions: knowing that the FT of a Gaussian is a Gaussian in its dual space doesn’t help with “how close” a t-domain distribution F(t) is to a t-domain Gaussian G(t), you’ve just moved the problem into dual space.
We have a tendency to want to reduce an infinite dimensional question to a one dimensional answer. How about the L1 norm or the L2 norm of the difference? Well, the L2 norm is preserved under FT, so nothing is gained. Using the L1 norm would require some justification other than “it makes calculation easy”.
So it really boils down to what question you are asking, what difference does the difference (between some function and the Gaussian) make? If being wrong (F(t) != G(t) for some t) leads to a loss of money, then use this as the “loss” function. If it is lives saved or lost use that loss function on the space of distributions. All such loss functions will look like an integral over the domain of L(F(t), G(t)). In this framework, there is no universal answer, but once you’ve decided what your loss function is and what your tolerance is you can now compute how many approximations it takes to get your loss below your tolerance.
Another way of looking at it is to understand what we are trying to compare the closeness of the test distribution to. It is not enough to say F(t) is this close to the Gaussian unless you can also tell me what it is not. (This is the “define a cat” problem for elementary school kids.) Is it not close to a Laplace distribution? How far away from Laplace is your test distribution compared to how far away it is from Gaussian? For these kinds of questions—where you want to distinguish between two (or more) possible candidate distributions—the Likelihood ratio is a useful metric.
Most data sceancetists and machine learning smiths I’ve worked with assume that in “big data” everything is going to be a normal distribution “because Central Limit Theorem”. But they don’t stop to check that their final distribution is actually Gaussian (they just calculate the mean and the variance and make all sorts of parametric assumptions and p-value type interpretations based on some z-score), much less whether the process that is supposed to give rise to the final distribution is one of sampling repeatedly from different distributions or can be genuinely modeled as convolutions.
One example: the distribution of coefficients in a Logistic model is assumed (by all I’ve spoken to) to be Gaussian (“It is peaked in the middle and tails off to the ends.”). Analysis shows it to be closer to Laplace, and one can model the regression process itself as a diffusion equation in one dimension, whose solution is … Laplace!
I can provide an additional example, this time of a sampling process, where one is sampling from hundreds of distributions of different sizes (or weights), most of which are close to Gaussian. The distribution of the sum is once again, Laplace! With the right assumptions, one can mathematically show how you get Laplace from Gaussians.
Thank you, that provided a lot of additional details.
I was interested in visual closeness and I think sum of abs delta would be a good fit. That doesn’t invalidate any of your points.
Actually, I’m very interested in these conditions. Can you elaborate?
The Berry-Essen theorem uses Kolmogorov-Smirnov distance to measure similarity to Gaussian—what’s the maximum difference between the CDF of the two distributions across all values of x?
As this measure is on absolute difference rather than fractional difference it doesn’t really care about the tails and so skew is the main thing stopping this measure approaching Gaussian. In this case the theorem says error reduces with root n.
From other comments it seems skew isn’t the best measure for getting kurtosis similar to a Gaussian, rather kurtosis (and variance) of the initial function(s) is a better predictor and skew only effects it inasmuch as skew and kurtosis/variance are correlated.
Great theorem! Altho note that it’s “Esseen” not “Essen”.
Ha, I don’t know how many times I have read that in the last couple of days and completely failed to notice!
I think this is a very useful measure for practical applications.
I thought about that but didn’t try it—maybe the sum of the absolute difference would work well. I’d tried KS distance, and also taking sum(sum(P(x > y) over y) over x), and wasn’t happy with either.
Why not the Total Variation norm? KS distance is also a good candidate.
I think I didn’t like the supremum part of the KS distance (which it looks like Total Variation has too) - felt like using just the supremum was using too little information. But it might have worked out anyway.
Fixed—thanks! (Although your example doesn’t sum to 1, so is not an example of a distribution, I think?)
If you want mean 0 and variance 1, scale the example to [−√3 ,0,0,0,0,√3 ].