In Estimating Norovirus
Prevalence I wrote up an estimate for how many people currently
had Norovirus at various times in the past, describing some of my work
on the
P2RA
project at the
NAO.
That post is prose, though, with inline calculations, and this has
a few drawbacks:
The calculations are manual, which makes it harder to catch
errors.
It’s hard to tell exactly how sourcing works for inputs that
are being pulled in from elsewhere.
You might want multiple estimates based on changing inputs over time.
At the NAO we also started with prose estimates, in Google Docs, but
in addition to the issues above we found that the review tooling
wasn’t a good fit for the kind of deep reviews we wanted. After some
initial struggles we switched to Python to represent our estimates;
you can see the code on
github.
An estimate is a combination of inputs (numbers we get from somewhere
else) and calculations (how we combine those inputs). Most of the
effort is in the inputs: making sure it’s clear where the numbers come
from. For example, here’s how we represent that the CDC
estimates that there were 1.2M people with HIV in the US in 2019:
The to_rate method checks that the locations and dates
are compatible, does the division, gives us a Prevalence.
For a more complex example, you could look at norovirus.py.
This is doing the calculation from the previous blog
post, with the addition of estimates for the Norovirus Genogroup I
and II subtypes.
For each estimate one team member creates the initial estimate and
then we use GitHub’s code review process for a line-by-line review.
This includes validating all inputs match what’s listed on the
external site and that we’re using the data source correctly, in
addition to checking that the overall structure of the estimate is
reasonable.
I think there’s a decent chance that this isn’t the final way this
will look, and as we get further along in the project we’ll want to
make the details of the prevalence calculation available to the model
so it can understand uncertainty. Most of the work is in determining
the inputs and reviewing the structure of the estimate, however, so if
we do need to make changes like that I expect they won’t be too bad.
This post describes in-progress work at the NAO and covers work from a
team. The Python-based estimation framework is mostly my work, with
help from Simon Grimm and Dan Rice.
Machine-Readable Prevalence Estimates
Link post
In Estimating Norovirus Prevalence I wrote up an estimate for how many people currently had Norovirus at various times in the past, describing some of my work on the P2RA project at the NAO. That post is prose, though, with inline calculations, and this has a few drawbacks:
The calculations are manual, which makes it harder to catch errors.
It’s hard to tell exactly how sourcing works for inputs that are being pulled in from elsewhere.
You might want multiple estimates based on changing inputs over time.
At the NAO we also started with prose estimates, in Google Docs, but in addition to the issues above we found that the review tooling wasn’t a good fit for the kind of deep reviews we wanted. After some initial struggles we switched to Python to represent our estimates; you can see the code on github.
An estimate is a combination of inputs (numbers we get from somewhere else) and calculations (how we combine those inputs). Most of the effort is in the inputs: making sure it’s clear where the numbers come from. For example, here’s how we represent that the CDC estimates that there were 1.2M people with HIV in the US in 2019:
An absolute prevalence isn’t useful to us without connecting it to a population. Here’s how we could represent the corresponding population:
And we can connect these to get a
Prevalence
:The
to_rate
method checks that the locations and dates are compatible, does the division, gives us aPrevalence
.For a more complex example, you could look at norovirus.py. This is doing the calculation from the previous blog post, with the addition of estimates for the Norovirus Genogroup I and II subtypes.
For each estimate one team member creates the initial estimate and then we use GitHub’s code review process for a line-by-line review. This includes validating all inputs match what’s listed on the external site and that we’re using the data source correctly, in addition to checking that the overall structure of the estimate is reasonable.
I think there’s a decent chance that this isn’t the final way this will look, and as we get further along in the project we’ll want to make the details of the prevalence calculation available to the model so it can understand uncertainty. Most of the work is in determining the inputs and reviewing the structure of the estimate, however, so if we do need to make changes like that I expect they won’t be too bad.
This post describes in-progress work at the NAO and covers work from a team. The Python-based estimation framework is mostly my work, with help from Simon Grimm and Dan Rice.