The graph showing Kurtosis vs convolutions for the 5 distributions could be interpreted as showing that distributions with higher initial kurtosis take longer to tend towards normal. Can you elaborate why initial skew is a better indicator than initial kurtosis?
The skew vs kurtosis graph suggests that there’s possibly a sweet spot for skew of about 0.25 which enables faster approach to normality than 0. I guess this isn’t real but it adds to my confusion above.
Yes, exactly right: initial kurtosis is a fine indicator of how many convolutions it will take to reach kurtosis = 3. Actually, it’s probably a better indicator than skew, if you already have the kurtosis on hand. Two reasons I chose to look at in in terms of skew:
the main reason: it’s easier to eye skew. I can look at a graph and think “damn that’s skewed!”, but I’m less able to look and say “boy is that kurtose!”. I’m not as familiar with kurtosis, geometrically, though, so maybe others more familiar would not have this problem. It’s also easier for me to reason about skew; I know that income and spend distributions are often skewed, but there aren’t any common real world problems I find myself thinking are more or less kurtose.
I suspect—I’m not sure—but I suspect that distance-from-kurtosis-3 is a monotonically decreasing function of #-of-convolutions. In that case, to say “things that start closer to three stay closer to three after applying a monotonic decreasing function” felt, I guess, a little bit obvious?
Re: the beta(20, 10) making it look like there’s a sweet spot around skew=0.25: correct that that isn’t real. beta(20, 10) is super Gaussian (has very low kurtosis) even before any convolutions.
So my understanding then would be that initial skew tells you how fast you will approach the skew of a Gaussian (i.e. 0) and initial kurtosis tells you how fast you approach the kurtosis of a Gaussian (I.e. 3)?
Using my calibrated eyeball it looks like each time you convolve a function with itself the kurtosis moves half of the distance to 3. If this is true (or close to true) and if there is a similar rule for skew then that would seem super useful.
I do have some experience in distributions where kurtosis is very important. For one example I initially was modelling to a normal distribution but found as more data became available that I was better to replace that with a logistic distribution with thicker tails. This can be very important for analysing safety critical components where the tail of the distribution is key.
If you have two independent things with kurtoses k1,k2 and corresponding variances v1,v2 then their sum (i.e., the convolution of the probability distributions) has kurtosis (v1v1+v2)2k1+(v2v1+v2)2k2+6v1v2(v1+v2)2 (in general there are two more cross-terms involving “cokurtosis” values that equal 0 in this case, and the last term involves another cokurtosis that equals 1 in this case).
We can rewrite this as (v1v1+v2)2(k1−3)+(v2v1+v2)2(k2−3)+3((v1v1+v2)2+2v1v2(v1+v2)2+(v2v1+v2)2) which equals (v1v1+v2)2(k1−3)+(v2v1+v2)2(k2−3)+3. So if both kurtoses differ from 3 by at most δ then the new kurtosis differs from 3 by at most v21+v22(v1+v2)2δ which is at most δ, and strictly less provided both variances are nonzero. If v1=v2 then indeed the factor is exactly 1⁄2.
So Maxwell’s suspicions and Bucky’s calibrated eyeball are both correct.
The graph showing Kurtosis vs convolutions for the 5 distributions could be interpreted as showing that distributions with higher initial kurtosis take longer to tend towards normal. Can you elaborate why initial skew is a better indicator than initial kurtosis?
The skew vs kurtosis graph suggests that there’s possibly a sweet spot for skew of about 0.25 which enables faster approach to normality than 0. I guess this isn’t real but it adds to my confusion above.
Yes, exactly right: initial kurtosis is a fine indicator of how many convolutions it will take to reach kurtosis = 3. Actually, it’s probably a better indicator than skew, if you already have the kurtosis on hand. Two reasons I chose to look at in in terms of skew:
the main reason: it’s easier to eye skew. I can look at a graph and think “damn that’s skewed!”, but I’m less able to look and say “boy is that kurtose!”. I’m not as familiar with kurtosis, geometrically, though, so maybe others more familiar would not have this problem. It’s also easier for me to reason about skew; I know that income and spend distributions are often skewed, but there aren’t any common real world problems I find myself thinking are more or less kurtose.
I suspect—I’m not sure—but I suspect that distance-from-kurtosis-3 is a monotonically decreasing function of #-of-convolutions. In that case, to say “things that start closer to three stay closer to three after applying a monotonic decreasing function” felt, I guess, a little bit obvious?
Re: the beta(20, 10) making it look like there’s a sweet spot around skew=0.25: correct that that isn’t real. beta(20, 10) is super Gaussian (has very low kurtosis) even before any convolutions.
So my understanding then would be that initial skew tells you how fast you will approach the skew of a Gaussian (i.e. 0) and initial kurtosis tells you how fast you approach the kurtosis of a Gaussian (I.e. 3)?
Using my calibrated eyeball it looks like each time you convolve a function with itself the kurtosis moves half of the distance to 3. If this is true (or close to true) and if there is a similar rule for skew then that would seem super useful.
I do have some experience in distributions where kurtosis is very important. For one example I initially was modelling to a normal distribution but found as more data became available that I was better to replace that with a logistic distribution with thicker tails. This can be very important for analysing safety critical components where the tail of the distribution is key.
If you have two independent things with kurtoses k1,k2 and corresponding variances v1,v2 then their sum (i.e., the convolution of the probability distributions) has kurtosis (v1v1+v2)2k1+(v2v1+v2)2k2+6v1v2(v1+v2)2 (in general there are two more cross-terms involving “cokurtosis” values that equal 0 in this case, and the last term involves another cokurtosis that equals 1 in this case).
We can rewrite this as (v1v1+v2)2(k1−3)+(v2v1+v2)2(k2−3)+3((v1v1+v2)2+2v1v2(v1+v2)2+(v2v1+v2)2) which equals (v1v1+v2)2(k1−3)+(v2v1+v2)2(k2−3)+3. So if both kurtoses differ from 3 by at most δ then the new kurtosis differs from 3 by at most v21+v22(v1+v2)2δ which is at most δ, and strictly less provided both variances are nonzero. If v1=v2 then indeed the factor is exactly 1⁄2.
So Maxwell’s suspicions and Bucky’s calibrated eyeball are both correct.
Wow! Cool—thanks!
Those possible approximate rules are interesting. I’m not sure about the answers to any of those questions.