A Back-Of-The-Envelope Calculation On How Unlikely The Circumstantial Evidence Around Covid-19 Is

Follow-up to: https://​​www.lesswrong.com/​​posts/​​Rof3ctjMMWxpaRj5Z/​​the-math-of-suspicious-coincidences

This post will be a very preliminary model of the circumstantial evidence around covid-19′s origin and how unlikely we are to actually see worlds like this under the Grand Null Hypothesis that there has been no foul play and covid-19 is a completely natural virus that just happened to appear.

nature.com/search?q=%22risk%22%20dangerous%20%22virus%22%20pandemic&date_range=1845-2019&order=date_desc&title=risky%20research


What do we actually need to explain?

  1. Coincidence of Location: Wuhan is a particularly special place in China for studying covid-19; the WIV group was both the most important, most highly-cited group before 2020, and the only group that was doing GoF on bat sarbecoronaviruses as far as I know. Wuhan is about 0.5% of China’s population. It’s a suspicious coincidence that a viral pandemic would occur in the same city as the most prominent group that studies it.

  2. Coincidence of timing: several things happened that presaged the emergence of covid-19. In December 2017, the US government lifted a ban on risky pathogen research, and in mid-2018 the Ecohealth group started planning how to make covid in the DEFUSE proposal. A natural spillover event could have happened at any time over either the last, say, 40 years or (probably) the next 40 years, though likely not much before that due to changing patterns of movement (I need help on exactly how wide this time interval is).

  3. Warnings turning out to be accurate: Warnings were given in Nature specifically mentioning the WIV/​Zhengli Shi group and no other group involved with coronaviruses, and only a few other groups involved with any viruses at all (in other articles). There were hundreds of groups that could have been warned about I think, but this article gives 59 as the number of BSL-4 labs around the world. This is a subtler point than those above because getting a warning is extra evidence for the lab leak hypothesis even conditional on the timing and location coincidence. Warnings were also given about WIV itself independent of the connection to coronaviruses too.

  4. Specific Features of covid-19 are a close match for what was planned in the DEFUSE proposal: This gets a lot more technical, but you can imagine a world in which labs randomly generate GoF proposals like DEFUSE and nature randomly generates viruses via natural evolution. Even in cases where you get a location coincidence as in (1), the average GoF proposal might not match a randomly paired up natural virus as well as covid-19 matches DEFUSE. This is very hard for me to assess, but US Right to Know has a summary.

It’s hard for me to objectively ballpark (4) without help from a few experts. But (1), (2), and (3) are fairly easy to get a ballpark figure for. I think these three are all pretty much independent, so the overall probability of these three things happening under the null hypothesis is just the product − 1200 × 280 × 359 = 1157,000

If we take the information from US Right to Know at face value, then the evidence that that adds is about 1300 or so for the furin cleavage site positioned in the spike protein at the S1/​S2 junction, about 1/​1000 for the BsmBI anomalies and BsmBI being found in DEFUSE, some unknown amount for early infectiousness (ballpark 1100) which adds up to 1 in 30 million. I think we can round this down to 1 in 500 or so as that is the minimum defensible chance that the people doing these analyses are like completely insane and wrong or some other wacky thing has happened in the technical details.

Multiplying all this together gets you to a 1 in 80 million chance of all this stuff happening under the null hypothesis, which is highly significant.

Of course this is just a ballpark figure and I think you could easily make it higher by having more confidence in the technical details of point (4), or lower by chipping away at (1) to (3) using more specific models of timing and distribution of natural spillovers, and perhaps by finding many more warnings being given that I couldn’t find.

EDIT:

I think that there is also a lack of a plausible animal host for covid-19 (I mean, other than the humanized mice at WIV), though I am somewhat unsure as to how suspicious this is. In previous pandemics that involved spillovers, this is typically identified quite fast.

EDIT, again:

Adding to the case for a lab leak is the fact that the prior isn’t that low, there have been a lot of biosafety incidents including (likely) the 1977 H1N1 flu, the foot and mouth outbreak in the UK in 2007, and a confirmed lab leak of covid-19 in Taipei.

We can also note that famous people have made bets that there would be a bio-safety incident by late 2020, so the prior cannot be that low.

As for the probability of the evidence under the alternate hypothesis, a lab leak easily explains the coincidence in space and time, and even conditional on that I think a warning about the place that the pandemic started is much likelier if it was a leak—the chain of causation is clear, warners notice a specific risk, warn about it, and the risk manifests as predicted. Given that Rees was able to predict the pandemic on general principle, it seems reasonable to assign 50% probability to P(Accurate Warnings|Lab Leak). The technical material such as BsmblI follows a similar pattern, especially since BsmblI is mentioned in DEFUSE. So I think under the alternate hypothesis we are looking at something like P(Evidence|Lab Leak) = 0.5⁴ = 6% or maybe more conservatively 0.2⁴ = 1.6%.

You can go through Bayes rule, but 80 million is large enough to completely swamp the prior. So either something is wrong with this whole exercise, or the lab leak is basically certain.

EDIT: Another thing I thought of is that independently of time, location and prior warnings, the mere fact that covid-19 was so wildly successful as a disease is evidence of a lab leak of a GoF virus, since GoF viruses are deliberately made to be more harmful and more transmissible. But it may be a bit hard to quantify this. I think there’s probably a factor of 10 for LL here though.

EDIT: a more specific calculation for the timing coincidence under the alternative hypothesis follows. Let’s think of pandemics as biased coin with probability p of getting heads, heads = you get a global pandemic. Pre-GoF (i.e. pre-2011) the coin was had a roughly p=0.01 chance of rolling heads. From 2018 onwards the coin is replaced with a biased coin with a larger chance of rolling heads. We roll once in 2018 (tails) then once in 2019 (heads). We need a reasonable prior for the coin to do the calculation.

We can use the fact that Rees and Pinker made a bet about this, let’s say they are jointly assigning 50% per 4-year period of a million+ casualty event. But presumably Pinker believed something like the Null Hypothesis, so his probability would be much less than 50% (perhaps 4 or 5%) and so we can model Rees as thinking the probability is much higher, say 80%. We can, split Rees’ 80% into bioterror and bioerror. Let’s allocate 65% per 4 years to bioerror (error > terror, since there are examples of the former but not the latter), that’s 23% per year. According to the Wikipedia list of plagues, counting only plagues which killed a higher proportion of the global population than covid-19, we get one every 150 years. 150 years is longer than the 80 or so years that this could have happened in China, but since the Pinker/​Rees bet is global I think we can stick with global figures. This would give us a likelihood ratio of about 0.23/​0.0066 = 34.

Another way to look at this is to note that 2017 is the only year that anyone in history has ever bet publicly on a bioerror or bioterror attack, combined with the only year that the US has greenlit dangerous GoF work, combined with the availability of the technology to do this kind of thing.