Model estimating the number of infected persons in the bay area

[Edit: I already found one er­ror in my spread­sheet, and ad­justed the in­cu­ba­tion rate, which de­creased my re­sults by an or­der of mag­ni­tude. My up to date spread­sheet is here, but heed it at your own risk.]

[Epistemic sta­tus: Quite un­cer­tain. It seems plau­si­ble that I made a ma­jor math er­ror and this model is flat-out wrong, or that some of the in­puts I used were very off. Best to think of this as a draft.]

[Thank you to Eliz­a­beth Gar­rett, Luke Raskopf , jim­ran­domh, and PeterH.]

In my coro­n­avirus plan­ning, the crux be­tween differ­ent ac­tions is of­ten “how many peo­ple are in­fected (as op­posed to symp­tomatic) on a given day?” (For in­stance, when 0.5% of the Bay area pop­u­la­tion is in­fected, I’m go­ing to stop go­ing to the gym.)

This post walks through the model that I’m us­ing to es­ti­mate cur­rent in­fec­tion rates. I’d be grate­ful for any­one sug­gest­ing im­prove­ments, nit­pick­ing the in­puts, and es­pe­cially cor­rect­ing er­rors.

I’m com­put­ing my es­ti­mates in this messy spread­sheet, which is au­to­mat­i­cally im­port­ing data from John Hop­kins CSSE’s github repo. (Thanks PeterH!)

Ba­sic argument

(This model is a vari­a­tion of one that Eliz­a­beth Gar­rett shared with me. Please give credit where credit is due.)

My goal is to es­ti­mate the num­ber of peo­ple who are in­fected (who are car­ri­ers of the dis­ease) rather than the num­ber of peo­ple who are cur­rently suffer­ing symp­toms. Here I’m go­ing to walk through a se­ries of steps, start­ing from the num­ber of con­firmed cases in a lo­ca­tion, and de­rive and es­ti­mate of the num­ber of in­fected per­sons in that pop­u­la­tion.

Use di­ag­no­sis rate and num­ber of con­firmed cases, to get the to­tal num­ber of symp­tomatic cases

To es­ti­mate the num­ber of peo­ple cur­rently in­fected, I start with the num­ber of new cases that that were di­ag­nosed in the past dou­bling pe­riod.

But not all the peo­ple who de­vel­oped symp­toms are con­firmed as hav­ing the dis­ease. Pre­sum­ably some frac­tion (less than one) of all peo­ple who de­velop symp­toms are suc­cess­fully di­ag­nosed. But if you know what that frac­tion (the di­ag­no­sis rate or con­fir­ma­tion rate) is, you can get the to­tal num­ber of cases by mul­ti­ply­ing the con­firmed num­ber of cases by one over the di­ag­no­sis rate.

to­tal cases that be­came symp­tomatic in the past dou­bling pe­riod = cases con­firmed in the most re­cent dou­bling pe­riod * 1 /​ con­fir­ma­tion rate

Use dou­bling time and re­cent daily cases, to get the num­ber of cases one dou­bling time ago

If you know the dou­bling time of the dis­ease, and you know how many new cases there were in the past one dou­bling time, you know how many cases there were at the be­gin­ning of that dou­bling time.

For in­stance, if you know that a dis­ease has a dou­bling time of one week, and you know that there were 50 new cases over the past week, that means there must have been 50 cases a week ago. (Be­cause that’s what a dou­bling time means. After one dou­bling time, there are twice as many cases as you started with).

to­tal cases that be­came symp­tomatic in the past dou­bling pe­riod = to­tal cases that had already shown symp­toms at the be­gin­ning of that dou­bling period

Use to­tal num­ber of cases and in­cu­ba­tion pe­riod, to get the num­ber of peo­ple who be­came in­fected one in­cu­ba­tion pe­riod ago

How­ever, the num­ber of symp­tomatic cases, lags be­hind the num­ber of in­fected peo­ple, be­cause there’s an in­cu­ba­tion pe­riod.

If we treat the in­cu­ba­tion pe­riod as uniform, that means that the to­tal num­ber of peo­ple that have shown symp­toms, on any given day, is equal to the num­ber of peo­ple who were in­fected one in­cu­ba­tion pe­riod ago.

So we’re now es­ti­mat­ing the num­ber of peo­ple that were in­fected two steps in the past: a dou­bling time and and an in­cu­ba­tion pe­riod ago.

Use the num­ber of in­fected peo­ple (one dou­bling time + one in­cu­ba­tion pe­riod) ago and the dou­bling time, to get the cur­rent num­ber of in­fected people

Once you have a num­ber of peo­ple who were in­fected (though not nec­es­sar­ily symp­tomatic) a dou­bling time and an in­cu­ba­tion pe­riod ago, you can mul­ti­ply that num­ber by 2 raised to “how­ever many dou­bling times there have been since that day”.

This gives us an es­ti­mate of the num­ber of peo­ple who are cur­rently in­fected.

(If you see any er­rors, please leave a com­ment!)

Con­clu­sion with cur­rent numbers

Given the above model, we can plug in some available num­bers to get an es­ti­mate of how many peo­ple in the Bay area are cur­rently (as of the evening of March 8, 2020) in­fected with COVID-19.

For num­ber of con­firmed cases, I’m us­ing the data from John Hop­kins CSSE. [See the “in­ter­me­di­ate calcu­la­tions” tab of the spread­sheet].

(Note that these num­bers are in­clud­ing the Grand Princess Cruise Ship, which is cur­rently in the pa­cific off the coast of Cal­ifor­nia.)

I’ve heard that the dou­bling time for COVID-19 is be­tween 3.5 and 7 days, so I calcu­lated both of those, for a rough lower and up­per bound. As more data comes in, I’ll be able to ob­serve the dou­bling time in the Bay area di­rectly, and use that for fu­ture calcu­la­tions.

(For calcu­lat­ing the num­ber of new cases in the past 3.5 days, I took the differ­ence be­tween to­day and the av­er­age of 3 and 4 days ago.)

I’m very un­cer­tain about what a rea­son­able con­fir­ma­tion rate is. Are 50% of symp­tomatic cases suc­cess­fully be­ing di­ag­nosed as COVID-19? Are 30%? 10%? 1%?!

I elected to take all of them, and com­pute the num­ber of peo­ple who are in­fected as a func­tion of the con­fir­ma­tion rate. [see “Num­ber of in­fected peo­ple” tab in the spread­sheet.]

Plot­ted on a log scale:

Again, I’m un­sure what kind of di­ag­no­sis rate is rea­son­able. But, from a rough guess, I would be sur­prised if it was less than 5%, and sur­prised if it was much more than 70%.

So that gives an up­per bound of 89,740 in­fected peo­ple (about 1.15% of the pop­u­la­tion of the Bay area) and a lower bound of 1,393 in­fected peo­ple (less than .01% of the pop­u­la­tion of the Bay area.

Note that that up­per bound, in par­tic­u­lar, is very sen­si­tive to changes in the con­fir­ma­tion rate: if we as­sume that 10% of cases are suc­cess­fully di­ag­nosed, our num­ber of in­fected per­sons drops to 44,870 (~0.5% of BA pop­u­la­tion).

Not­ing some sim­plify­ing as­sump­tions that I’m mak­ing:

  • I’m as­sum­ing that the spread of coro­n­avirus is well-mod­eled by an ex­po­nen­tial func­tion.

  • I’m as­sum­ing that ev­ery­one who is in­fected be­gins dis­play­ing symp­toms ex­actly 2 weeks later.

    • (To the ex­tant that in­fectees show symp­toms ear­lier than two weeks, these mod­els are over­es­ti­mat­ing the true val­ues be­cause there are fewer dou­bling times be­tween in­fec­tion and con­fir­ma­tion).

  • I’m as­sum­ing that ev­ery­one who gets coro­n­avirus is di­ag­nosed and con­firmed as hav­ing coro­n­avirus on the day they de­velop symp­toms.

    • In re­al­ity, there’s prob­a­bly a lag (does any­one know how much of a lag?), which means these num­bers will un­der­es­ti­mate the true value, be­cause we’re ac­tu­ally get­ting data about who started show­ing symp­toms a few days ago.


Again, please help me cor­rect any mis­takes. Ad­di­tion­ally, if any­one has bet­ter data for any of these in­puts than I’ve used here, es­pe­cially for the con­fir­ma­tion rate, please share.

And if you have a differ­ent model, please post it! I would rather be tak­ing my es­ti­mates from an en­sem­ble of model