GeneSmith comments on GeneSmith’s Shortform

GeneSmith 28 Dec 2025 1:09 UTC
33 points
4
This was perhaps an understandable viewpoint to hold in June when the best publicly available IQ predictor from the EA4 study only correlated with actual IQ at .3 in the general population (and less within family, which is what matters for embryo selection).
I happened to have spoken with some of the people from Herasight at the time and knew they had a predictor that performed quite a bit better than what was publicly available, which is where my optimism was coming from.
In October they finally published their validation white paper so now I can point to something other than private conversations to show you really can get as big of a boost as claimed.
Some people are still skeptical. Sasha Gusev for example has claimed that Herasight applied a “fudge factor” to get to 20% of variance explained by adjusting adjusting for the noisiness of the UKBB and ABCD cohorts. This is based on the fact that their raw predictor explained 13.7% of the within-family variance, and they applied an “adjustment factor” to that based on the fact that the test they validated on only has a test-retest correlation of .61.
I don’t find the critique all that convincing, though my knowledge in the reliability of different psychometric methods is still pretty limited so take my opinion with a grain of salt. It’s well known that UKBB’s fluid intelligence test is pretty noisy, and the method they used to correct for that (disattenuation) seems pretty bog-standard.

They also published a follow-up in which they used another method, latent variable modeling, which produced similar results.
All that being said, it would be better if there were third-party benchmarks like we have in the AI field to evaluate the relative strength of all these different predictors.

I think it’s probably about time to create or fund an org to do this kind of thing. We need something like METR or MLPerf for genetic predictors. No such benchmarks exist right now.

This is actually a real problem. No dataset exists right now that we can guarantee hasn’t been used in the training of these models. And while I basically believe that most of these companies have done their evaluations honestly (with the possible exception of Nucleus), relying on companies honestly reporting predictor performance when they have an economic incentive to exaggerate or cheat is not ideal.
I think you could actuallly start out with an incredibly small dataset. Even just 100 samples would be enough to make a binary “bullshit” or “plausible” validation set on continuous value predictors like height or IQ.