I chatted with Michael and Ammon. This made me somewhat more hopeful about this effort, because their plan wasn’t on the less-sensible end of what I uncertainly imagined from the post (e.g. they’re not going to just train a big very-nonlinear map from genomes to phenotypes, which by default would make the data problem worse not better).
I have lots of (somewhat layman) question marks about the plan, but it seems exciting/worth trying. I hope that if there are some skilled/smart/motivated/curious ML people seeing this, who want to work on something really cool and/or that could massively help the world, you’ll consider reaching out to Tabula.
An example of the sort of thing they’re planning on trying:
1: Train an autoregressive model on many genomes as base-pair sequences, both human and non-human. (Maybe upweight more-conserved regions, on the theory that they’re conserved because under more pressure to be functional, hence more important for phenotypes.)
1.5: Hope that this training run learns latent representations that make interesting/important features more explicit.
2: Train a linear or linear-ish predictor from the latent activations to some phenotype (disease, personality, IQ, etc.).
IDK if I expect this to work well, but it seems like it might. Some question marks:
Do we have enough data? If we’re trying to understand rare variants, we want whole genome sequences. My quick convenience sample turned up about 2 million publicly announced whole genomes. Maybe there’s more like 5 million, and lots of biobanks have been ramping up recently. But still, pretending you have access to all this data, this means you see any given region a couple million times. Not sure what to think of that; I guess it depends what/how we’re trying to learn.
If we focus on conserved regions, we probably successfully pull attention away from regions that really don’t matter. But we might also pull attention away from regions that sorta matter, but to which phenotypes of interest aren’t terribly sensitive to. It stands to reason that such regions or variants wouldn’t be the most conserved regions or variants. I don’t think this defeats the logic, but it suggests we could maybe do better. Example of other approaches: upvote regions that are recognized as having the format of genes or regulatory regions; upvote regions around SNPs that have shown up in GWASes.
For the exact setup described above, with autoregression on raw genomes, do we really learn much about variants? I guess what we ought to learn something about is approximate haplotypes, i.e. linkage disequilibrium structure. The predictor should be like “ok, at the O(10kb) moment, I’m in such-and-such gene regulatory module, and I’m looking at haplotype number 12 for this region, so by default I should expect the following variants to be from that haplotype, unless I see a couple variants that are rarely with H12 but all come from H5 or something”. I don’t see how this would help identify causal SNPs out of haplotypes? But it could very well make the linear regression problem significantly easier? Not sure.
But also I wouldn’t expect this exact setup to tell us much interesting stuff about anything long-range, e.g. protein-protein interactions. Well, on the other hand, possibly there’d be some shared representation between “either DNA sequence X; or else a gene that codes for a protein that has a sequence of zinc fingers that match up with X”? IDK what to expect. This could maybe be enhanced with models of protein interactions, transcription factor binding affinities, activity of regulatory regions, etc. etc.
More generally, I’m excited about someone making a concerted and sane effort to try putting biological priors to use for genomic predictions. As a random example (which may not make much sense, but to give some more flavor): Maybe one could look at AlphaFold’s predictions of protein conformation with different rare genetic variants that we’ve marked as deleterious for some trait. If the predictions are fairly similar for the different variants, we don’t conclude much—maybe this rare variant has some other benefit. But if the rare variant makes AlphaFold predict “no stable conformation”, then we take this as some evidence that the rare variant is purely deleterious, and therefore especially safe to alter to the common variant.
I chatted with Michael and Ammon. This made me somewhat more hopeful about this effort, because their plan wasn’t on the less-sensible end of what I uncertainly imagined from the post (e.g. they’re not going to just train a big very-nonlinear map from genomes to phenotypes, which by default would make the data problem worse not better).
I have lots of (somewhat layman) question marks about the plan, but it seems exciting/worth trying. I hope that if there are some skilled/smart/motivated/curious ML people seeing this, who want to work on something really cool and/or that could massively help the world, you’ll consider reaching out to Tabula.
An example of the sort of thing they’re planning on trying:
1: Train an autoregressive model on many genomes as base-pair sequences, both human and non-human. (Maybe upweight more-conserved regions, on the theory that they’re conserved because under more pressure to be functional, hence more important for phenotypes.)
1.5: Hope that this training run learns latent representations that make interesting/important features more explicit.
2: Train a linear or linear-ish predictor from the latent activations to some phenotype (disease, personality, IQ, etc.).
IDK if I expect this to work well, but it seems like it might. Some question marks:
Do we have enough data? If we’re trying to understand rare variants, we want whole genome sequences. My quick convenience sample turned up about 2 million publicly announced whole genomes. Maybe there’s more like 5 million, and lots of biobanks have been ramping up recently. But still, pretending you have access to all this data, this means you see any given region a couple million times. Not sure what to think of that; I guess it depends what/how we’re trying to learn.
If we focus on conserved regions, we probably successfully pull attention away from regions that really don’t matter. But we might also pull attention away from regions that sorta matter, but to which phenotypes of interest aren’t terribly sensitive to. It stands to reason that such regions or variants wouldn’t be the most conserved regions or variants. I don’t think this defeats the logic, but it suggests we could maybe do better. Example of other approaches: upvote regions that are recognized as having the format of genes or regulatory regions; upvote regions around SNPs that have shown up in GWASes.
For the exact setup described above, with autoregression on raw genomes, do we really learn much about variants? I guess what we ought to learn something about is approximate haplotypes, i.e. linkage disequilibrium structure. The predictor should be like “ok, at the O(10kb) moment, I’m in such-and-such gene regulatory module, and I’m looking at haplotype number 12 for this region, so by default I should expect the following variants to be from that haplotype, unless I see a couple variants that are rarely with H12 but all come from H5 or something”. I don’t see how this would help identify causal SNPs out of haplotypes? But it could very well make the linear regression problem significantly easier? Not sure.
But also I wouldn’t expect this exact setup to tell us much interesting stuff about anything long-range, e.g. protein-protein interactions. Well, on the other hand, possibly there’d be some shared representation between “either DNA sequence X; or else a gene that codes for a protein that has a sequence of zinc fingers that match up with X”? IDK what to expect. This could maybe be enhanced with models of protein interactions, transcription factor binding affinities, activity of regulatory regions, etc. etc.
More generally, I’m excited about someone making a concerted and sane effort to try putting biological priors to use for genomic predictions. As a random example (which may not make much sense, but to give some more flavor): Maybe one could look at AlphaFold’s predictions of protein conformation with different rare genetic variants that we’ve marked as deleterious for some trait. If the predictions are fairly similar for the different variants, we don’t conclude much—maybe this rare variant has some other benefit. But if the rare variant makes AlphaFold predict “no stable conformation”, then we take this as some evidence that the rare variant is purely deleterious, and therefore especially safe to alter to the common variant.
I’m not as concerned about your points because there are a number of projects already doing something similar and (if you believe them) succeeding at it. Here’s a paper comparing some of them: https://www.biorxiv.org/content/10.1101/2025.02.11.637758v2.full