Interesting idea! I haven’t looked at the viral composition of sequencing aside from checking for covid, so that’s a good place to try and improve this estimate.
It’s been more than a decade that I was studying bioinformatics at university so I don’t know how capable the algorithms are nowadays.
What are the computing costs to identify thousands of different species within the 795M reads? If you actually want to search for new viruses you might even have to match millions of species.
There are more details in the NAO preprint (https://arxiv.org/pdf/2108.02678.pdf), but the basic idea is that if something is spreading through a population it will grow approximately exponentially in the early stages, and will ideally leave something visible in the form of exponentially increasing genetic sequences in wastewater. If we sequence deeply enough, we might be able to identify these based only on this growth pattern.
The current plan is not “map every read back to something known” but “look for things that are growing exponentially”, which is computationally much cheaper. Something like “count how often every 40-mer occurs, then run statistical analyses to identify ones that are growing exponentially, then look at those more closely”. This is much cheaper than the sequencing, though that’s a bit of a low bar since the amount of sequencing in the paper would cost maybe $10k in a single run (and likely cost them much more, since they used many smaller runs).