Quadratic models and (un)falsified data

On 5th Feb a commenter on Reddit posted that the coronavirus cases in China were following a suspiciously accurate quadratic curve, implying that China was making up their data and weren’t even bothering to hide it particularly well.

This set off my bullshit sensors fairly strongly so I wanted to check it out.

Having looked at the data myself, there may be other reasons why the Chinese data are not accurate but I think the quadratic pattern provides no evidence in favour of falsification.

I doubt people are making many decisions based on that post but possibly looking at where the statistics went wrong may be beneficial.

R^2

It turns out that the claim of a “near perfect” model was based on a very high value (0.9995).

is often chosen to summarise how precisely a regression fits the data – it tells you how much of the variance in the data is explained by the equation. Surely if only 1 part in 2,000 isn’t explained by the model that indicates that the data is fabricated?

The first thing to note is that “variance” is a technical term which isn’t the same thing as the natural understanding of “variation”. If you don’t appreciate this then will seem more impressive than it is. In particular, variance isn’t the deviation from the mean, but the squared deviation. In the more natural understanding, the quadratic model explains 44 parts in 45 of the deviation from the mean, not 1,999 in 2,000. This is still pretty good but seems like less evidence for fabrication.

An alternative explanation of the value is that it compares two models:

1. The model that you are trying to fit to the data

2. The model where the value of y is expected to be the same for all values of x (the mean of the y values)

A high value is telling us that model 1 is a much better fit that model 2.

However in our case we already know model 2 is going to be a terrible, terrible, terrible fit to our data. The y value used in the regression is total cases so far. So model 2 represents some number of cases having already been identified at the beginning of the time period in question and no more cases occurring during the time period.

So saying that model 1 accounts for 44 parts in 45 of the deviation between the data and model 2 doesn’t really tell me much – model 2 is a lost cause.

New cases per day

The problem here is the chosen y-axis. Instead of choosing total number of cases by a certain day, it would be better to choose new cases per day. This removes excess correlation between data points.

If we do this then instead of fitting a quadratic curve we need to be fitting a straight line (we’re taking the derivative with respect to x) but I’ll still call it quadratic for the sake of consistency. Model 2 changes from representing a constant total number of cases to a constant number of new cases every day. This still isn’t a particularly likely model but is certainly an improvement.

Plotting and regressing we get:

Our value has gone down to 0.96. This is still high and suggests that quadratic growth is a fairly good model for the data but isn’t suspicious. For instance, I can also fit a power law (again 2 free parameters) to the data and get = 0.966.

So within the training set our quadratic model (linear new cases per day) explains a comparable level of variance as a power law model of new cases does.

Looking outside the dataset

The obvious thing to do is check whether the pattern was there outside the dataset.

It wasn’t.

It is clear that the pattern breaks down shortly after it was noticed (day 15 on this chart).

In addition we can look at the pattern before the training set. Again as soon as we go outside of the dataset used to create the graph the pattern completely breaks down. This is not surprising as at this point the quadratic model predicts a negative number of new cases per day.

(Looking a deaths instead of cases shows a similar story. For deaths the pattern keeps going for a little longer but even then the power law fit matches the data better.)

So to make the case for China falsifying the data quadratically you have to also say that the start date for them doing it was ~20th Jan and the end date more-or-less straight after the pattern was noticed. Presumably this would be justified by China having been caught out and changing from then on.

Or possibly, this is just how the virus develops. Now that we have some developments in other countries it is possible to compare spread rates. Where the disease has got out of containment there is a remarkably consistent pattern of growth which matches the China rate very closely. (I’m planning to write a more detailed post on this.)