I recently made this great repo for polygenic prediction and embryo selection which I want to share with people. I’ve wanted something like this for almost a decade, and it’s so easy now that we have these superhuman coding models.
Note that I also have this longer technical essay attached to the repo, as well as these slides (I think they’re both very nice!)
Let’s look at how everything works now.
Data
My repo pulls in data for existing predictors from the pgs (polygenic score) catalog, and filters to the best weights for each feature using claude’s best judgment (this worked better than using simpler heuristics like recency and dataset size). There are predictors for intelligence, height, and many disease traits. Across adults these correlate with measured phenotype at around 0.3, 0.65, and 0.15-0.3 after accounting for obvious confounders like sex and age, so pretty nontrivial.
In addition to uploading those final prediction weights, researchers will also upload per-snp (single-nucleotide polymorphism) correlations for each trait. Remarkably, those open-source gwas (genome-wide association study) sumstats are sufficient to rederive state of the art predictors. The field has rallied around developing techniques like lassosum or LDpred or SBayesRC for learning pgs weights, each of which assumes that all you have access to is these gwas sumstats, along with population-level linkage-disequilibrium matrices encoding how frequently neighboring snp’s occur together compared to chance. It’s actually kind of a blessing, since you can go further and aggregate open-source gwas sumstats across biobanks, without ever getting formal portal access.
I made a slightly improved intelligence predictor by combining the ~250k intelligence datapoints with the ~750k public education datapoints using this technique called MTAG, which is roughly equivalent to taking a linear combination of the intelligence and education predictors, with the weighting chosen to maximize intelligence prediction. You can get even fancier, especially by doing more feature engineering using traits like reaction time, visual pairs matching etc., but I wanted to keep things relatively simple given that there isn’t enough public data to distinguish different IQ predictors.
Speaking of which, I did some basic “it doesn’t look like things are horribly broken” validation where I tried to predict people’s heights based on their OpenSNP data. For the 500 people I have heights for, I got an of 0.3, which is right on target given that the snp arrays only partially overlapped with the pgs weights, and I was using vanilla mean imputation for the missing snps. I similarly got an r of 0.3 for SAT score for the 100 people who entered that info. You can do “pseudovalidation” again using sumstats as further validation, but I would only think of that as a rough sanity check.
You can run your own 23andMe data through my repo’s predictors, although it’s kind of navel gazey imo. You already know how tall/smart/handsome you are...
embryo selection
The main value of polygenic scoring in my opinion is for prospective parents going through ivf who want to select which embryo to implant based on a polygenic trait, like intelligence or various disease risks. You get a very good predictor for any given embryo based on the parents’ phenotypes, but then all the signal between embryos comes from polygenic scoring.
Given polygenic scores, my repo converts them to qaly’s (quality-adjusted life-years), and then produces a table like this one:
Note that there are several arbitrary choices here, especially around time discounting. By default these qaly numbers are computed with no time discounting, and their sizes would shrink if you used a 3% per year discount or something.
This also represents a somewhat optimistic scenario — best of five healthy embryos, with a white couple where the statistics are more accurate. That being said, I am accounting for the within-family attenuation that comes from factors like genetic nurture and assortative mating (which cuts the variance explained by a factor of 0.5 for IQ and 0.4 for years of education!)
Also a minor issue is that I’m not using parental phenotypes to get the mean for disease risks, which would better estimate the marginal gains from pushing that risk down.
imputation
Arguably the most subtle part of this project is centered around imputation i.e. filling in the most likely values for snp’s that are missing.
For snp data from 23andMe, you get an x boost from imputation, using a standard tool like Beagle or the Michigan Imputation Server, okay.
For embryos, imputation is more involved. You can’t actually sequence an embryo to high fidelity. Rather on day five the clinic will slice off a few cells from the part of the embryo that will eventually become the placenta, which is a routine procedure at this point for detecting if the embryo has the right number of chromosomes. The data you can get from such a small number of cells is very noisy—the reads are very sparse, which is enough to tell if you have three copies of some chromosome, but not nearly enough data for a polygenic scoring model to use. The trick then, is to get whole genome sequences from the parents (with imputed genomes from snp data nearly as good). If you know the parents’ haplotypes with certainty, then you can model the embryo’s maternal haplotype as coming from the mom’s haplotype + crossover, and similarly for the paternal haplotype. A straightforward hidden markov model (HMM) then gives the most likely imputation for the embryo.
In general though, there are switching errors when you try and phase someone’s genome into its haplotype, which invalidates the basic HMM. Instead you need a hidden markov model that tracks all embryos simultaneously, and takes advantage of how a single parental switching error is more likely than simultaneous crossover events across all the embryos. Performance here degrades a bit if you only have a couple embryos, but by less than you might think. There’s also a more brute force solution for fixing the parental phasing where you get a grandparent or other child’s genome, okay.
closing thoughts
I think my repo is pretty cool, but at present I think the potential gains are relatively small. If you have five people and choose the person who’s the most extreme along a given trait, they’ll typically be around 1.2 standard deviations above average and fall in something like the 88th percentile. If you’re instead selecting based on a pgs with an of 0.1 amongst people who already share half their dna, then the top person will typically be standard deviations above average, landing in the 58th percentile. (The first follows directly from theory from only having half as much dna to select on, and the second comes from empirical results about within-family attenuation for intelligence and years of education.)
I think things will get more exciting once we have better biotech and larger datasets. You could imagine having enough selection power to really minimize the risk of twenty different diseases simultaneously, including debilitating psychiatric diseases, and to have children who are essentially arbitrarily smart. I’m not sure how much those things will matter in a post-AGI world, but they seem like part of a good future to me.
There were a few bugs that I hit while working on this—I didn’t realize that you have to subtract off the top ~10 principal components to account for non-causal correlations,(otherwise you get nonsense numbers when trying to extrapolate to non-white people!); I didn’t even realize that switching errors during parental phasing were something to worry about; there are various errors you can make when trying to reconcile different data formats, e.g. different generations of 23andMe chips; it’s hard to find the best papers from pgs catalog because the metrics that researchers report aren’t fully comparable etc. Please complain about any bugs that you see, and I’ll try and fix them if they’re serious.
an open-source repo for embryo selection
I recently made this great repo for polygenic prediction and embryo selection which I want to share with people. I’ve wanted something like this for almost a decade, and it’s so easy now that we have these superhuman coding models.
Note that I also have this longer technical essay attached to the repo, as well as these slides (I think they’re both very nice!)
Let’s look at how everything works now.
Data
My repo pulls in data for existing predictors from the pgs (polygenic score) catalog, and filters to the best weights for each feature using claude’s best judgment (this worked better than using simpler heuristics like recency and dataset size). There are predictors for intelligence, height, and many disease traits. Across adults these correlate with measured phenotype at around 0.3, 0.65, and 0.15-0.3 after accounting for obvious confounders like sex and age, so pretty nontrivial.
In addition to uploading those final prediction weights, researchers will also upload per-snp (single-nucleotide polymorphism) correlations for each trait. Remarkably, those open-source gwas (genome-wide association study) sumstats are sufficient to rederive state of the art predictors. The field has rallied around developing techniques like lassosum or LDpred or SBayesRC for learning pgs weights, each of which assumes that all you have access to is these gwas sumstats, along with population-level linkage-disequilibrium matrices encoding how frequently neighboring snp’s occur together compared to chance. It’s actually kind of a blessing, since you can go further and aggregate open-source gwas sumstats across biobanks, without ever getting formal portal access.
I made a slightly improved intelligence predictor by combining the ~250k intelligence datapoints with the ~750k public education datapoints using this technique called MTAG, which is roughly equivalent to taking a linear combination of the intelligence and education predictors, with the weighting chosen to maximize intelligence prediction. You can get even fancier, especially by doing more feature engineering using traits like reaction time, visual pairs matching etc., but I wanted to keep things relatively simple given that there isn’t enough public data to distinguish different IQ predictors.
Speaking of which, I did some basic “it doesn’t look like things are horribly broken” validation where I tried to predict people’s heights based on their OpenSNP data. For the 500 people I have heights for, I got an of 0.3, which is right on target given that the snp arrays only partially overlapped with the pgs weights, and I was using vanilla mean imputation for the missing snps. I similarly got an r of 0.3 for SAT score for the 100 people who entered that info. You can do “pseudovalidation” again using sumstats as further validation, but I would only think of that as a rough sanity check.
You can run your own 23andMe data through my repo’s predictors, although it’s kind of navel gazey imo. You already know how tall/smart/handsome you are...
embryo selection
The main value of polygenic scoring in my opinion is for prospective parents going through ivf who want to select which embryo to implant based on a polygenic trait, like intelligence or various disease risks. You get a very good predictor for any given embryo based on the parents’ phenotypes, but then all the signal between embryos comes from polygenic scoring.
Given polygenic scores, my repo converts them to qaly’s (quality-adjusted life-years), and then produces a table like this one:
Note that there are several arbitrary choices here, especially around time discounting. By default these qaly numbers are computed with no time discounting, and their sizes would shrink if you used a 3% per year discount or something.
This also represents a somewhat optimistic scenario — best of five healthy embryos, with a white couple where the statistics are more accurate. That being said, I am accounting for the within-family attenuation that comes from factors like genetic nurture and assortative mating (which cuts the variance explained by a factor of 0.5 for IQ and 0.4 for years of education!)
Also a minor issue is that I’m not using parental phenotypes to get the mean for disease risks, which would better estimate the marginal gains from pushing that risk down.
imputation
Arguably the most subtle part of this project is centered around imputation i.e. filling in the most likely values for snp’s that are missing.
For snp data from 23andMe, you get an x boost from imputation, using a standard tool like Beagle or the Michigan Imputation Server, okay.
For embryos, imputation is more involved. You can’t actually sequence an embryo to high fidelity. Rather on day five the clinic will slice off a few cells from the part of the embryo that will eventually become the placenta, which is a routine procedure at this point for detecting if the embryo has the right number of chromosomes. The data you can get from such a small number of cells is very noisy—the reads are very sparse, which is enough to tell if you have three copies of some chromosome, but not nearly enough data for a polygenic scoring model to use. The trick then, is to get whole genome sequences from the parents (with imputed genomes from snp data nearly as good). If you know the parents’ haplotypes with certainty, then you can model the embryo’s maternal haplotype as coming from the mom’s haplotype + crossover, and similarly for the paternal haplotype. A straightforward hidden markov model (HMM) then gives the most likely imputation for the embryo.
In general though, there are switching errors when you try and phase someone’s genome into its haplotype, which invalidates the basic HMM. Instead you need a hidden markov model that tracks all embryos simultaneously, and takes advantage of how a single parental switching error is more likely than simultaneous crossover events across all the embryos. Performance here degrades a bit if you only have a couple embryos, but by less than you might think. There’s also a more brute force solution for fixing the parental phasing where you get a grandparent or other child’s genome, okay.
closing thoughts
I think my repo is pretty cool, but at present I think the potential gains are relatively small. If you have five people and choose the person who’s the most extreme along a given trait, they’ll typically be around 1.2 standard deviations above average and fall in something like the 88th percentile. If you’re instead selecting based on a pgs with an of 0.1 amongst people who already share half their dna, then the top person will typically be standard deviations above average, landing in the 58th percentile. (The first follows directly from theory from only having half as much dna to select on, and the second comes from empirical results about within-family attenuation for intelligence and years of education.)
I think things will get more exciting once we have better biotech and larger datasets. You could imagine having enough selection power to really minimize the risk of twenty different diseases simultaneously, including debilitating psychiatric diseases, and to have children who are essentially arbitrarily smart. I’m not sure how much those things will matter in a post-AGI world, but they seem like part of a good future to me.
There were a few bugs that I hit while working on this—I didn’t realize that you have to subtract off the top ~10 principal components to account for non-causal correlations,(otherwise you get nonsense numbers when trying to extrapolate to non-white people!); I didn’t even realize that switching errors during parental phasing were something to worry about; there are various errors you can make when trying to reconcile different data formats, e.g. different generations of 23andMe chips; it’s hard to find the best papers from pgs catalog because the metrics that researchers report aren’t fully comparable etc. Please complain about any bugs that you see, and I’ll try and fix them if they’re serious.