Note that parts of my post are actually model-free! For example, the mathematical definition and the example of twin studies do not make use of a model
Yes, good point, I should have said “unlike regression” rather than “unlike variance explained”. I’ll have to think more on how the type of analysis described in the twin example maps onto information theory.
But this is predicated on the implicit model that Y is a normally distributed variable.
I’m not aware of (implicitly) making that assumption in my post!
By “this” I meant the immediately preceding statements in my post. (Although the cartoon distributions you show do look normal-ish, so at least you invite that intuition). The idea that the mean or average is a good measure of central tendency of a distribution, or a good estimator, is so familiar we forget that it requires justification. For Normal distributions, it is the lowest MSE estimator, the maximum likelihood estimator, and is an unbiased estimator, but this isn’t true of all distributions. For a skewed, long-tailed distribution, for example, the median is a better estimator. For a binomial distribution, the mean is almost never the maximum likelihood estimator. For a Cauchy distribution, the mean is not even defined (although to be fair I’m not entirely sure entropy is well defined in that case, either). Likewise the properties of variance that make it a good estimator of dispersion for a Normal distribution don’t necessarily make it good for other distributions.
It is true that partitioning of variance and “variance explained” as such don’t rely on a normality assumption, and there are non-parametric versions of regression, correlation, ANOVA etc. that don’t assume normality. So I have not entirely put my finger on what the difference is.
You can measure mutual information even if the form of the relationship is unknown or complicated.
Is this so? Suppose we’d want to measure differential entropy, as a simplified example, and the true density “oscillates” a lot. In that case, I’d expect that the entropy is different than what it is if the density were smoother. But it might be hard to see the difference in a small dataset. The type of regularity/simplicity assumptions about the density might thus influence the result.
This might be a good place to mention that I work exclusively with discrete entropy, and am not very familiar with notations or proofs in differential (continuous) entropy. So if Y is continuous, in practice this involves discretizing the value of Y (binning your histograms). I agree the continuous case would be more directly comparable, but I don’t think this is likely to be fundamentally important, do you?
In principle, conceptually, you can estimate entropy directly from the probability density function (PDF) non-parametrically as H = sum(-P log2 P), where the sum is over all possible values of Y, and P is the probability Y takes on a given value.[1] Likewise, you can estimate the mutual information directly from the joint probability distribution between X and Y, the equation for which I won’t try to write out here without an equation editor. In practice, if Y is continuous, the more data you have, the more finely you can discretize Y and the more subtly you can describe the shape of the distribution, so you converge on the true PDF and thus the true entropy as the data size goes to infinity.
I’m not denying that it can take a lot of data to measure entropy or mutual information by brute force in this way. What is worse, these naive estimators are biased if distributions are under-sampled. So getting a good estimate of entropy or mutual information from data is very tricky, and the shape of the distribution can make the estimation more or less tricky. To the extent one relies on regularity or simplicity assumptions to overcome data limitations, these assumptions can affect your result.
Still, if you are careful about it, an estimate based on assumptions can still be a strict bound: X removes at least z% of your uncertainty about Y. There is a direct analogy in regression models: if Yhat=f(X) explains z% of the variance of y (assuming this is established properly), then x “Platonically” explains at least z% of the variance of y.
Relatedly, you can pre-process X into some derived variable such as Q=f(X) or an estimator Yhat=f(X), and then measure the mutual information between the derived variable and the true value of Y. The Data Processing Inequality states that if the derived variable contains Z amount of information about Y, the input variable X must contain at least that much information. This is very much like defining a particular regression model f(X); and in the Yhat=f(X) case, it does give you a model you can use to predict Y from X.
The idea that the mean or average is a good measure of central tendency of a distribution, or a good estimator, is so familiar we forget that it requires justification. For Normal distributions, it is the lowest MSE estimator, the maximum likelihood estimator, and is an unbiased estimator, but this isn’t true of all distributions. For a skewed, long-tailed distribution, for example, the median is a better estimator.
Is it correct to say that the mean is a good estimator whenever the variance is finite? If so, maybe I should have added that assumption to the post.
I wonder how to think about that in the case of entropy, which you thought about analyzing. Differential entropy can also be infinite, for example. But the Cauchy distribution, which you mention, has infinite variance but finite differential entropy, at least.
1.sorry, I haven’t figured out the equation editor yet.
You can type Cmd+4 to type inline latex formulas, and Cmd+m to type standalone latex formulas! Hope that helps.
In principle, conceptually, you can estimate entropy directly from the probability density function (PDF) non-parametrically as H = sum(-P log2 P), where the sum is over all possible values of Y, and P is the probability Y takes on a given value. Likewise, you can estimate the mutual information directly from the joint probability distribution between X and Y, the equation for which I won’t try to write out here without an equation editor.
Note: After writing the next paragraph, I noticed that you made essentially the same points further below in your answer, but I’m still keeping my paragraph here for completeness.
I was more wondering whether we can estimate them from data, where we don’t get the ground-truth values for the probabilities that appear in the formulas for entropy and mutual information, at least not directly. If we have lots of data, then we can approximate a PDF, that is true, but I’m not aware of a way of doing so that is entirely principles or works without regularity assumptions. As an example, let’s say we want to estimate the conditional entropy H(Y∣X) (a replacement for the “remaining variance” in my post) for continuous X and Y. I think in this case, if all sampled x-values differ from each other, you could in principle come to the conclusion that there is no uncertainty in Y conditional on X at all since you observe only one Y-value for each X-value. But that would be severe overfitting, similar to what you’d expect in my section titled “When you have lots of data” for continuous X.
Maybe it would be interesting to analyze the conditional entropy case for non-continuous distributions where variance makes less sense.
I think from my point of view we’re largely in agreement, thanks for your further elaborations!
Is it correct to say that the mean is a good estimator whenever the variance is finite?
Well, yes, in the sense that the law of large numbers applies, i.e.
limn→∞Pr{|¯x−E[X]|<ε}=1∀ε>0
The condition for that to hold is actually weaker. If all the xi are not only drawn from the same distribution, but are also independent, the existence of a finite E[X] is necessary and sufficient for the sample mean to converge in probability to E[X] as n goes to infinity, if I understand the theorem correctly (I can’t prove that yet though; the proof with a finite variance is easy). If xi aren’t independent, the necessary condition is still weaker than the finite variance, but it’s cumbersome and impractical, so finite variance is fine I guess.
But that kind of isn’t enough to always justify the use of a sample mean as an estimator in practice? As foodforthought says, for a normal distribution it’s simultaneously the lowest MSE estimator, the maximum likelihood estimator, and is an unbiased estimator, but that’s not true for other distributions.
A quick example: suppose we want to determine the parameter p of a Bernoulli random variable, i.e. “a coin”. The prior distribution over p is uniform; we flip the coin n=10 times, and use the sample success rate, kn, i.e. the mean, i.e. the maximum likelihood estimate. Per simulation the mean squared error E[(kn−p)2] is about 0.0167. However, if we use k+1n+2 instead, the mean squared error drops to 0.0139 (code).
Honestly though, all of this seems like a frequentist cockamamie to me. We can’t escape prior distributions; we may as well stop pretending that they don’t exist. Just calculate a posterior and do whatever you want with it. E.g., how did I come up with the k+1n+2 example? Well, it’s the expected value of the posterior beta distribution for p if the prior is uniform, so it also gives a lower MSE.
Yes, good point, I should have said “unlike regression” rather than “unlike variance explained”. I’ll have to think more on how the type of analysis described in the twin example maps onto information theory.
By “this” I meant the immediately preceding statements in my post. (Although the cartoon distributions you show do look normal-ish, so at least you invite that intuition). The idea that the mean or average is a good measure of central tendency of a distribution, or a good estimator, is so familiar we forget that it requires justification. For Normal distributions, it is the lowest MSE estimator, the maximum likelihood estimator, and is an unbiased estimator, but this isn’t true of all distributions. For a skewed, long-tailed distribution, for example, the median is a better estimator. For a binomial distribution, the mean is almost never the maximum likelihood estimator. For a Cauchy distribution, the mean is not even defined (although to be fair I’m not entirely sure entropy is well defined in that case, either). Likewise the properties of variance that make it a good estimator of dispersion for a Normal distribution don’t necessarily make it good for other distributions.
It is true that partitioning of variance and “variance explained” as such don’t rely on a normality assumption, and there are non-parametric versions of regression, correlation, ANOVA etc. that don’t assume normality. So I have not entirely put my finger on what the difference is.
This might be a good place to mention that I work exclusively with discrete entropy, and am not very familiar with notations or proofs in differential (continuous) entropy. So if Y is continuous, in practice this involves discretizing the value of Y (binning your histograms). I agree the continuous case would be more directly comparable, but I don’t think this is likely to be fundamentally important, do you?
In principle, conceptually, you can estimate entropy directly from the probability density function (PDF) non-parametrically as H = sum(-P log2 P), where the sum is over all possible values of Y, and P is the probability Y takes on a given value.[1] Likewise, you can estimate the mutual information directly from the joint probability distribution between X and Y, the equation for which I won’t try to write out here without an equation editor. In practice, if Y is continuous, the more data you have, the more finely you can discretize Y and the more subtly you can describe the shape of the distribution, so you converge on the true PDF and thus the true entropy as the data size goes to infinity.
I’m not denying that it can take a lot of data to measure entropy or mutual information by brute force in this way. What is worse, these naive estimators are biased if distributions are under-sampled. So getting a good estimate of entropy or mutual information from data is very tricky, and the shape of the distribution can make the estimation more or less tricky. To the extent one relies on regularity or simplicity assumptions to overcome data limitations, these assumptions can affect your result.
Still, if you are careful about it, an estimate based on assumptions can still be a strict bound: X removes at least z% of your uncertainty about Y. There is a direct analogy in regression models: if Yhat=f(X) explains z% of the variance of y (assuming this is established properly), then x “Platonically” explains at least z% of the variance of y.
Relatedly, you can pre-process X into some derived variable such as Q=f(X) or an estimator Yhat=f(X), and then measure the mutual information between the derived variable and the true value of Y. The Data Processing Inequality states that if the derived variable contains Z amount of information about Y, the input variable X must contain at least that much information. This is very much like defining a particular regression model f(X); and in the Yhat=f(X) case, it does give you a model you can use to predict Y from X.
sorry, I haven’t figured out the equation editor yet.
Is it correct to say that the mean is a good estimator whenever the variance is finite? If so, maybe I should have added that assumption to the post.
I wonder how to think about that in the case of entropy, which you thought about analyzing. Differential entropy can also be infinite, for example. But the Cauchy distribution, which you mention, has infinite variance but finite differential entropy, at least.
You can type Cmd+4 to type inline latex formulas, and Cmd+m to type standalone latex formulas! Hope that helps.
Note: After writing the next paragraph, I noticed that you made essentially the same points further below in your answer, but I’m still keeping my paragraph here for completeness.
I was more wondering whether we can estimate them from data, where we don’t get the ground-truth values for the probabilities that appear in the formulas for entropy and mutual information, at least not directly. If we have lots of data, then we can approximate a PDF, that is true, but I’m not aware of a way of doing so that is entirely principles or works without regularity assumptions. As an example, let’s say we want to estimate the conditional entropy H(Y∣X) (a replacement for the “remaining variance” in my post) for continuous X and Y. I think in this case, if all sampled x-values differ from each other, you could in principle come to the conclusion that there is no uncertainty in Y conditional on X at all since you observe only one Y-value for each X-value. But that would be severe overfitting, similar to what you’d expect in my section titled “When you have lots of data” for continuous X.
Maybe it would be interesting to analyze the conditional entropy case for non-continuous distributions where variance makes less sense.
I think from my point of view we’re largely in agreement, thanks for your further elaborations!
Well, yes, in the sense that the law of large numbers applies, i.e.
limn→∞Pr{|¯x−E[X]|<ε}=1∀ε>0The condition for that to hold is actually weaker. If all the xi are not only drawn from the same distribution, but are also independent, the existence of a finite E[X] is necessary and sufficient for the sample mean to converge in probability to E[X] as n goes to infinity, if I understand the theorem correctly (I can’t prove that yet though; the proof with a finite variance is easy). If xi aren’t independent, the necessary condition is still weaker than the finite variance, but it’s cumbersome and impractical, so finite variance is fine I guess.
But that kind of isn’t enough to always justify the use of a sample mean as an estimator in practice? As foodforthought says, for a normal distribution it’s simultaneously the lowest MSE estimator, the maximum likelihood estimator, and is an unbiased estimator, but that’s not true for other distributions.
A quick example: suppose we want to determine the parameter p of a Bernoulli random variable, i.e. “a coin”. The prior distribution over p is uniform; we flip the coin n=10 times, and use the sample success rate, kn, i.e. the mean, i.e. the maximum likelihood estimate. Per simulation the mean squared error E[(kn−p)2] is about 0.0167. However, if we use k+1n+2 instead, the mean squared error drops to 0.0139 (code).
Honestly though, all of this seems like a frequentist cockamamie to me. We can’t escape prior distributions; we may as well stop pretending that they don’t exist. Just calculate a posterior and do whatever you want with it. E.g., how did I come up with the k+1n+2 example? Well, it’s the expected value of the posterior beta distribution for p if the prior is uniform, so it also gives a lower MSE.