In my opinion, this post misses the main challenge with the “platonic ideal” vision of a concept like % of explained variance: it’s something that accidentally depends on the distribution of your variables in the sample, rather than a fundamental property of the relationship between those variables.
The total variance Var(Y) depends on the distribution of not only X, but all variables affecting Y. Also, because in general Var(Y|X) can depend on X, the term in the definition, which is really the average of Var(Y|X) across all X, depends on the distribution of X.
So from my understanding and in my practical experience using statistics, the coefficient of determination (what % of variance of Y is explained by X) always provides “narrow” information tailored to a specific context, not general information about a platonic ideal understanding of the relationship between X and Y. [I understand it might appear a bit defeatist relative to the goal set by OP… personally I have found accepting that has been useful to avoid a lot of unproductive arguments in real world cases.]
Heritability is a classic example. It can be useful in the context of e.g. breeding crops in specific conditions and predicting how much selection on a trait will affect the next generation, but the heritability you get overestimates how much genetics affect variance in the wild, where the environment is much more variable. Also, since there are genetics x environment interactions, Var(phenotype|genotype) estimated from a limited range of environments is not a good estimate of the average Var(phenotype|genotype) globally.
In the context of twins studies, if you think of twins raised together (as they would be in most cases), then the “predictors” that they share and that could influence whatever outcome you’re measuring include a lot more than just genes. So if you’re comparing twins vs. non-twins siblings, you overestimate the genetic portion of heritability of the characteristic of interest, since all your comparisons are based on a greater shared environment that 2 random people in the population.
--
As a sidenote, this is a compromise that occurs a lot in study design: if you need to estimate the relationship between Y and X, it’s actually useful to have a population that varies less by factors other than X, but doing so potentially limits the generalization of the results to the broader population. If you want a broader sample so your conclusions are more likely to apply in different contexts, you may need a very large sample size, because in effect you need to calculate conditional distributions of (Y|X) for all sorts of combinations of the other variables.
---
Lastly, even if somehow you were able to calculate a version of “% of human characteristic that’s genetically explained”, that would be the true average across all populations / cultures / etc., you get the problem that the underlying distributions are not fixed in time. In my view, an answer that’s contingent of very specific distribution of cultural practices / human environments available at this moment is not a very fundamental quantity of interest, it’s more like an accidental characteristic as I mention above.
In my opinion, this post misses the main challenge with the “platonic ideal” vision of a concept like percentage of explained variance: it’s something that accidentally depends on the distribution of your variables in the sample, rather than a fundamental property of the relationship between those variables.
Perhaps we need to step back and clarify what “Platonic Explained Variance” could even mean. All knowledge is contextual; it is a mistake to expect that there is a Truth to be known devoid of context. I supposed that the OP meant by this phrase something like: the true, complete statistical dependence between X and Y in the sampled population, as against our estimate or approximation of that dependence based on a given limited sample or statistical model. In any case, I’d like to argue that such distinction makes sense, while it does not make sense to look for a statistical relationship between X and Y that is eternally and universally true, independent of a specific population.
When we are using empirical statistics to describe the relationship between measurable variables X and Y, I think the conclusions we draw are always limited to the population we sampled. That is the essential nature of the inference. Generalization to the sampled population as a whole carries some uncertainty, which we can quantify based on the size of the sample we used and the amount of variability we observed, subject to some assumptions (e.g., about the underlying distributions, or independence of our observations).
But generalization to any other population always entails additional assumptions. If the original sample was limited in scope (e.g. a particular age, sex, geographic location, time point, or subculture), generalization outside that scope entails a new conjecture that the new group is essentially the same as the original one in every respect relevant to the claim. To the extent the original sample was broad in scope, we can as you say test whether such other factors detectably modified the association between X and Y, and if so, include these effects as covariates in our statistical model. As you note, this requires a lot of statistical power. Even so, whenever we generalize outside that population, we assume the new population is similar in the ways that matter, for both the main association and the modifier effects.
A statistical association can be factually, reproducibly true of a population and still be purely accidental, in which case we don’t expect it to generalize. When we generalize to a new context or group or point in time, I think we are usually relying on an (implicit or explicit) model that the observed statistical relation between X and Y is a consequence of underlying causal mechanisms. If and to the extent that we know what causal mechanisms are at play, we have a basis for predicting or checking whether the relevant conditions apply in any new context. But (1) generalization of the causal mechanism to a new condition is still subject to verification; a causal model derived in a narrow context could be incomplete, and the new condition may differ in a way that turns out to be causally important in a way we didn’t suspect; and (2) even if the causal mechanism perfectly generalizes, we do not expect “the fraction of variance explained” to generalize universally. That value depends on a plethora of other random and causal factors that will in general be different between populations [^1].
Summing up, I think it’s a mistake to look for the ‘Platonic Variance Explained’ divorced from a specific population. But we can meaningfully ask if the statistical dependence we estimated from a finite empirical sample using a particular statistical model accurately reflects the true and complete statistical dependence between the variables in the population from which we sampled.
This account might be particular to the branches of natural science that seek mechanistic causal models and/or fundamental theories as explanations. Other fields of research or philosophic frameworks that lack or eschew causal explanation or theory may have a different epistemic account, which I’d be interested to hear about.
Yes, when trying to reuse the OP’s phrasing, maybe I wasn’t specific enough on what I meant. I wanted to highlight how the “fraction of variance explained” metric generalized less that other outputs from the same model.
For example, if you conceive a case where a model of E[y] vs. x provides good out-of-sample predictions even if the distribution of x changes, e.g. because x stays in the range used to fit the model, the fraction of variance explained is nevertheless sensitive to the distribution of x. Of course, you can have a confounder w that makes y(x) less accurate out-of-sample because its distribution changes and indirectly “breaks” the learned y(x) relationship, but then, w would influence the fraction of variance explained even if it’s not a confounder, even if it doesn’t break the validity of y(x).
Or for a more concrete example, maybe some nutrients (e.g. Vitamin C) are not as predictive of individual health as they were in the past, because most people just have enough of them in their diet, but fundamentally the relationship between those nutrients and health hasn’t changed, just the distribution; our model of that relationship is probably still good. This is a very simple example. Still, I think in general there is a lot of potential misinterpretation of this metric (not necessarily on this forum, but in public discourse broadly), especially as it is sometimes called a measure of variable importance. When I read the first part of this post about teachers from Scott Alexander: https://www.lesswrong.com/posts/K9aLcuxAPyf5jGyFX/teachers-much-more-than-you-wanted-to-know , I can’t conclude from “having different teachers explains 10% of the variance in test scores” that teaching quality doesn’t have much impact on the outcome. (And in fact, as a parent I would value teaching quality, but not a high variance in teaching quality within the school district. I wouldn’t want my kids learning of core topics to be strongly dependent of which school or which class in that school they are attending.)
Thanks, I think this is an excellent comment that gives lots of useful context.
To summarize briefly what foorforthought has already expressed, what I meant with platoninc variance explained is the explained variance independent of a specific sample or statistical model, but as you rightly point out, this still depends on lots of context that depends on crucial details of study design or the population one studies.
In my opinion, this post misses the main challenge with the “platonic ideal” vision of a concept like % of explained variance: it’s something that accidentally depends on the distribution of your variables in the sample, rather than a fundamental property of the relationship between those variables.
The total variance Var(Y) depends on the distribution of not only X, but all variables affecting Y. Also, because in general Var(Y|X) can depend on X, the term in the definition, which is really the average of Var(Y|X) across all X, depends on the distribution of X.
So from my understanding and in my practical experience using statistics, the coefficient of determination (what % of variance of Y is explained by X) always provides “narrow” information tailored to a specific context, not general information about a platonic ideal understanding of the relationship between X and Y. [I understand it might appear a bit defeatist relative to the goal set by OP… personally I have found accepting that has been useful to avoid a lot of unproductive arguments in real world cases.]
Heritability is a classic example. It can be useful in the context of e.g. breeding crops in specific conditions and predicting how much selection on a trait will affect the next generation, but the heritability you get overestimates how much genetics affect variance in the wild, where the environment is much more variable. Also, since there are genetics x environment interactions, Var(phenotype|genotype) estimated from a limited range of environments is not a good estimate of the average Var(phenotype|genotype) globally.
In the context of twins studies, if you think of twins raised together (as they would be in most cases), then the “predictors” that they share and that could influence whatever outcome you’re measuring include a lot more than just genes. So if you’re comparing twins vs. non-twins siblings, you overestimate the genetic portion of heritability of the characteristic of interest, since all your comparisons are based on a greater shared environment that 2 random people in the population.
--
As a sidenote, this is a compromise that occurs a lot in study design: if you need to estimate the relationship between Y and X, it’s actually useful to have a population that varies less by factors other than X, but doing so potentially limits the generalization of the results to the broader population. If you want a broader sample so your conclusions are more likely to apply in different contexts, you may need a very large sample size, because in effect you need to calculate conditional distributions of (Y|X) for all sorts of combinations of the other variables.
---
Lastly, even if somehow you were able to calculate a version of “% of human characteristic that’s genetically explained”, that would be the true average across all populations / cultures / etc., you get the problem that the underlying distributions are not fixed in time. In my view, an answer that’s contingent of very specific distribution of cultural practices / human environments available at this moment is not a very fundamental quantity of interest, it’s more like an accidental characteristic as I mention above.
Perhaps we need to step back and clarify what “Platonic Explained Variance” could even mean. All knowledge is contextual; it is a mistake to expect that there is a Truth to be known devoid of context. I supposed that the OP meant by this phrase something like: the true, complete statistical dependence between X and Y in the sampled population, as against our estimate or approximation of that dependence based on a given limited sample or statistical model. In any case, I’d like to argue that such distinction makes sense, while it does not make sense to look for a statistical relationship between X and Y that is eternally and universally true, independent of a specific population.
When we are using empirical statistics to describe the relationship between measurable variables X and Y, I think the conclusions we draw are always limited to the population we sampled. That is the essential nature of the inference. Generalization to the sampled population as a whole carries some uncertainty, which we can quantify based on the size of the sample we used and the amount of variability we observed, subject to some assumptions (e.g., about the underlying distributions, or independence of our observations).
But generalization to any other population always entails additional assumptions. If the original sample was limited in scope (e.g. a particular age, sex, geographic location, time point, or subculture), generalization outside that scope entails a new conjecture that the new group is essentially the same as the original one in every respect relevant to the claim. To the extent the original sample was broad in scope, we can as you say test whether such other factors detectably modified the association between X and Y, and if so, include these effects as covariates in our statistical model. As you note, this requires a lot of statistical power. Even so, whenever we generalize outside that population, we assume the new population is similar in the ways that matter, for both the main association and the modifier effects.
A statistical association can be factually, reproducibly true of a population and still be purely accidental, in which case we don’t expect it to generalize. When we generalize to a new context or group or point in time, I think we are usually relying on an (implicit or explicit) model that the observed statistical relation between X and Y is a consequence of underlying causal mechanisms. If and to the extent that we know what causal mechanisms are at play, we have a basis for predicting or checking whether the relevant conditions apply in any new context. But (1) generalization of the causal mechanism to a new condition is still subject to verification; a causal model derived in a narrow context could be incomplete, and the new condition may differ in a way that turns out to be causally important in a way we didn’t suspect; and (2) even if the causal mechanism perfectly generalizes, we do not expect “the fraction of variance explained” to generalize universally. That value depends on a plethora of other random and causal factors that will in general be different between populations [^1].
Summing up, I think it’s a mistake to look for the ‘Platonic Variance Explained’ divorced from a specific population. But we can meaningfully ask if the statistical dependence we estimated from a finite empirical sample using a particular statistical model accurately reflects the true and complete statistical dependence between the variables in the population from which we sampled.
This account might be particular to the branches of natural science that seek mechanistic causal models and/or fundamental theories as explanations. Other fields of research or philosophic frameworks that lack or eschew causal explanation or theory may have a different epistemic account, which I’d be interested to hear about.
Yes, when trying to reuse the OP’s phrasing, maybe I wasn’t specific enough on what I meant. I wanted to highlight how the “fraction of variance explained” metric generalized less that other outputs from the same model.
For example, if you conceive a case where a model of E[y] vs. x provides good out-of-sample predictions even if the distribution of x changes, e.g. because x stays in the range used to fit the model, the fraction of variance explained is nevertheless sensitive to the distribution of x. Of course, you can have a confounder w that makes y(x) less accurate out-of-sample because its distribution changes and indirectly “breaks” the learned y(x) relationship, but then, w would influence the fraction of variance explained even if it’s not a confounder, even if it doesn’t break the validity of y(x).
Or for a more concrete example, maybe some nutrients (e.g. Vitamin C) are not as predictive of individual health as they were in the past, because most people just have enough of them in their diet, but fundamentally the relationship between those nutrients and health hasn’t changed, just the distribution; our model of that relationship is probably still good. This is a very simple example. Still, I think in general there is a lot of potential misinterpretation of this metric (not necessarily on this forum, but in public discourse broadly), especially as it is sometimes called a measure of variable importance. When I read the first part of this post about teachers from Scott Alexander: https://www.lesswrong.com/posts/K9aLcuxAPyf5jGyFX/teachers-much-more-than-you-wanted-to-know , I can’t conclude from “having different teachers explains 10% of the variance in test scores” that teaching quality doesn’t have much impact on the outcome. (And in fact, as a parent I would value teaching quality, but not a high variance in teaching quality within the school district. I wouldn’t want my kids learning of core topics to be strongly dependent of which school or which class in that school they are attending.)
Thanks, I think this is an excellent comment that gives lots of useful context.
To summarize briefly what foorforthought has already expressed, what I meant with platoninc variance explained is the explained variance independent of a specific sample or statistical model, but as you rightly point out, this still depends on lots of context that depends on crucial details of study design or the population one studies.