In the post alluded to a nice self-contained tricky math inequality problem that I am hoping someone will be nerd-sniped by. (I am rusty on my linear algebra inequalities and I don’t care enough to spend more time on it.) Here’s what I wrote:
2025-01-18: I mentioned in a couple places that it might be possible to have non-additive genetic effects that are barely noticeable in rDZ-vs-12rMZ comparisons, but still sufficient to cause substantial Missing Heritability. The Zuk et al. 2012 paper and its supplementary information have some calculations relevant to this, I think? I only skimmed it. I’m not really sure about this one. If we assume that there’s no assortative mating, no shared environment effects, etc., then is there some formula (or maybe inequality) relating rDZ-vs-½rMZ to a numerical quantity of PGS Missing Heritability? I haven’t seen any such formula. This seems like a fun math problem—someone should figure it out or look it up, and tell me the answer!
More details: Basically, when rDZ is less than 12rMZ, then there has to be nonlinearity in the map from genomes to outcomes (leaving aside other possible causes). And if there’s nonlinearity, then the polygenic scores can’t be perfectly predictive. But I’m trying to relate those quantitatively.
Like, intuitively, if rMZ=1.000 and rDZ=0.499, then OK yes there’s nonlinearity, but probably not very much, so probably the polygenic score will work almost perfectly (again assuming infinite sample size etc).
…Conversely, if rMZ=1.000 and rDZ=0.001, then intuitively we would expect “extreme nonlinearity” and the polygenic scores should have very bad predictive power.
But are those always true, or are there pathological cases where they aren’t? That’s the math problem.
I tried this with reasoning LLMs a few months ago with the following prompt (not sure if I got it totally right!):
I have a linear algebra puzzle.
There’s a high-dimensional vector space G of genotypes.
There’s a probability distribution P within that space G, for the population.
There’s a function F : G → Real numbers, mapping genotypes to a phenotype.
There’s an “r” where we randomly and independently sample two points from P, call them X and Y, and find the (Pearson) correlation between F(X) and F((X+Y)/2).
If F is linear, I believe that r^2=0.5. But F is not necessarily linear.
Separately, we try to find a linear function G which approximates F as well as possible—i.e., the G that minimizes the average (F(X) - G(X))^2 for X sampled from P.
Let s^2 be the percent of variance in F explained by G, sampled over the P.
I’m looking for inequalities relating s^2 to r^2, ideally in both directions (one where r is related to an upper bound on s, the other a lower bound).
Commentary on that:
The X vs (X+Y)/2 is not exactly what happens with siblings. It’s similar to comparing a parent to their child—i.e., X is the genotype of the mother, Y the father, (X+Y)/2 the kid. But parent-child should be mathematically similar to sibling-sibling, since both are 50% relatedness. …Except it’s not really parent-child either, because if X has a SNP but Y doesn’t, then the child has the SNP with 50% probability, rather than having “half of that SNP”. But I figured it might amount to the same thing? But I do think you need to be randomizing over individual SNPs to formulate the problem for the actual sibling case we care about. (So really, the individuals are all binary / indicator vectors (entries are 1s and 0s), as opposed to arbitrary elements of the vector space. I’m just guessing that’s not too important to the problem.)
I’m assuming rMZ=1, and then “r” is rDZ, “G” is the polygenic score, and “s” quantifies the predictive power of the polygenic score.
I did this very quickly, there might be other mistakes in this problem formulation, and I wasn’t motivated enough to keep exploring it.
(Btw, I sent that prompt to a few AIs around January 2025, and they gave answers but I don’t think the answers were right.)
All methodology is from the first section of the appendix of the linked paper. The paper cited pages 81-87 of Genetics and Analysis of Quantitative Traits. I read from chapter 4 up until those pages to understand the method conceptually. Every niceness assumption is made except for “no shared environment.” For example, “no assortative mating.”
Changing some notation: rMZ=S+1, i.e. we normalize so that the total “variance due to genes” is 1. We assume that the variance due to shared environment is the same for twins and non-twins, SMZ=SDZ=S. This is a standard assumption in ACE, and it seems reasonable.R2 will represent the “nonlinear part of the effect due to genes,” i.e. that due to epistasis and dominance.VA=1−R2 is the effect due to alleles, what you called s.
Facts:
2(rMZ−rDZ−12)<R2≤4(rMZ−rDZ−12) always. (1)
When S=0:
2(12−rDZrMZ)<R2≤4(12−rDZrMZ). (2)
Note that R2≤1, so the upper bound becomes trivial when either score is 0.25.
Explanation:
We can decompose rMZ=S+VA+R2. We can decompose this further into R2=∑i,j≥0,(i,j)≠(1,0)VAiDj. That is, we’re taking the nonlinear part and decomposing it into interactions involving alleles across i loci and dominance effects in j loci. To understand dominance effects, note that a locus can have 0, 1, or 2 instances of an allele. The respective phenotypes resulting from these might not be produced by any linear function on alleles, because not every three points are colinear. The dominance term is the error resulting from a linear regression. If we were haploid, we wouldn’t have to deal with this. So for example, VA5D3 refers to phenotype effects that only appear when there is a specific combination of two alleles at three separate loci, and are multilinear in the alleles occurring at five other loci.
Interpreting this in context, rDZ=SDZ+12VA+∑i,j≥0,(i,j)≠(1,0)2−i−2jVAiDj. Now we can justify our conclusions. Note that the third term is at most 14R2 but has no lower bound. When we have to write it out, we’ll call it X. Remember that R2 is exactly the proportion of variance due to genes that cannot be captured by a polygenic score, the “phantom heritability.” The paper is concerned with how substantial R2 means VA<1, so that if the polygenic score is close to VA people will assume there is missing heritability when it reality the polygenic score is perfect and the heritability is simply nonlinear.
The ACE Estimate:
2(rMZ−rDZ)=VA+∑i,j≥0,(i,j)≠(1,0)(2−2−i−2j+1)VAiDj. This disagrees with the figure in the appendix of the paper. I believe they made an arithmetic error, but it is possible I made a conceptual error. Recalling VA=1−R2, rMZ−rDZ−12=∑i,j≥0,(i,j)≠(1,0)(12−2−i−2j)VAiDj. Those constant terms in the sum go as low as 14 and arbitrarily close to 12, so by taking the bounds and dividing we recover (1).
The Rule of Thumb:
rDZrMZ=S+12VA+X2S+VA+R2=SS+1+12(S+1)(1−R2)+1S+1X. Remember the bounds on X, we can write X=αR2, where α∈(0,14]. Combining this with the middle −R2 term we have 12+S2S+1−αR2 where α∈[14(S+1),12(S+1)). Doing the arithmetic
I don’t yet rigorously understand how R2 is decomposed into epistasis and dominance. The book gives only an intuition and not a proof. It is very ad hoc. Edit: As of yesterday, I now understand.
In the post alluded to a nice self-contained tricky math inequality problem that I am hoping someone will be nerd-sniped by. (I am rusty on my linear algebra inequalities and I don’t care enough to spend more time on it.) Here’s what I wrote:
More details: Basically, when rDZ is less than 12rMZ, then there has to be nonlinearity in the map from genomes to outcomes (leaving aside other possible causes). And if there’s nonlinearity, then the polygenic scores can’t be perfectly predictive. But I’m trying to relate those quantitatively.
Like, intuitively, if rMZ=1.000 and rDZ=0.499, then OK yes there’s nonlinearity, but probably not very much, so probably the polygenic score will work almost perfectly (again assuming infinite sample size etc).
…Conversely, if rMZ=1.000 and rDZ=0.001, then intuitively we would expect “extreme nonlinearity” and the polygenic scores should have very bad predictive power.
But are those always true, or are there pathological cases where they aren’t? That’s the math problem.
I tried this with reasoning LLMs a few months ago with the following prompt (not sure if I got it totally right!):
Commentary on that:
The X vs (X+Y)/2 is not exactly what happens with siblings. It’s similar to comparing a parent to their child—i.e., X is the genotype of the mother, Y the father, (X+Y)/2 the kid. But parent-child should be mathematically similar to sibling-sibling, since both are 50% relatedness. …Except it’s not really parent-child either, because if X has a SNP but Y doesn’t, then the child has the SNP with 50% probability, rather than having “half of that SNP”. But I figured it might amount to the same thing? But I do think you need to be randomizing over individual SNPs to formulate the problem for the actual sibling case we care about. (So really, the individuals are all binary / indicator vectors (entries are 1s and 0s), as opposed to arbitrary elements of the vector space. I’m just guessing that’s not too important to the problem.)
I’m assuming rMZ=1, and then “r” is rDZ, “G” is the polygenic score, and “s” quantifies the predictive power of the polygenic score.
I did this very quickly, there might be other mistakes in this problem formulation, and I wasn’t motivated enough to keep exploring it.
(Btw, I sent that prompt to a few AIs around January 2025, and they gave answers but I don’t think the answers were right.)
All methodology is from the first section of the appendix of the linked paper. The paper cited pages 81-87 of Genetics and Analysis of Quantitative Traits. I read from chapter 4 up until those pages to understand the method conceptually. Every niceness assumption is made except for “no shared environment.” For example, “no assortative mating.”
Changing some notation: rMZ=S+1, i.e. we normalize so that the total “variance due to genes” is 1. We assume that the variance due to shared environment is the same for twins and non-twins, SMZ=SDZ=S. This is a standard assumption in ACE, and it seems reasonable.R2 will represent the “nonlinear part of the effect due to genes,” i.e. that due to epistasis and dominance.VA=1−R2 is the effect due to alleles, what you called s.
Facts:
2(rMZ−rDZ−12)<R2≤4(rMZ−rDZ−12) always. (1)
When S=0:
2(12−rDZrMZ)<R2≤4(12−rDZrMZ). (2)
Note that R2≤1, so the upper bound becomes trivial when either score is 0.25.
Explanation:
We can decompose rMZ=S+VA+R2. We can decompose this further into R2=∑i,j≥0,(i,j)≠(1,0)VAiDj. That is, we’re taking the nonlinear part and decomposing it into interactions involving alleles across i loci and dominance effects in j loci.
To understand dominance effects, note that a locus can have 0, 1, or 2 instances of an allele. The respective phenotypes resulting from these might not be produced by any linear function on alleles, because not every three points are colinear. The dominance term is the error resulting from a linear regression. If we were haploid, we wouldn’t have to deal with this.
So for example, VA5D3 refers to phenotype effects that only appear when there is a specific combination of two alleles at three separate loci, and are multilinear in the alleles occurring at five other loci.
Interpreting this in context, rDZ=SDZ+12VA+∑i,j≥0,(i,j)≠(1,0)2−i−2jVAiDj.
Now we can justify our conclusions. Note that the third term is at most 14R2 but has no lower bound. When we have to write it out, we’ll call it X.
Remember that R2 is exactly the proportion of variance due to genes that cannot be captured by a polygenic score, the “phantom heritability.” The paper is concerned with how substantial R2 means VA<1, so that if the polygenic score is close to VA people will assume there is missing heritability when it reality the polygenic score is perfect and the heritability is simply nonlinear.
The ACE Estimate:
2(rMZ−rDZ)=VA+∑i,j≥0,(i,j)≠(1,0)(2−2−i−2j+1)VAiDj. This disagrees with the figure in the appendix of the paper. I believe they made an arithmetic error, but it is possible I made a conceptual error.
Recalling VA=1−R2, rMZ−rDZ−12=∑i,j≥0,(i,j)≠(1,0)(12−2−i−2j)VAiDj. Those constant terms in the sum go as low as 14 and arbitrarily close to 12, so by taking the bounds and dividing we recover (1).
The Rule of Thumb:
rDZrMZ=S+12VA+X2S+VA+R2=SS+1+12(S+1)(1−R2)+1S+1X. Remember the bounds on X, we can write X=αR2, where α∈(0,14]. Combining this with the middle −R2 term we have 12+S2S+1−αR2 where α∈[14(S+1),12(S+1)). Doing the arithmetic
R2∈(2(S+1)(12+S2S+1−rDZrMZ),4(S+1)(12+S2S+1−rDZrMZ)]. Picking S=0 yields (2).
Comments:
I don’t yet rigorously understand how R2 is decomposed into epistasis and dominance. The book gives only an intuition and not a proof. It is very ad hoc.
Edit: As of yesterday, I now understand.