We Still Don’t Know If Masks Work
Introduction:
Recently a paper was published which got attention by estimating that 100% mask wearing in the population would cause a 25% reduction in the effective transmission number (shortened to transmissiblity throughout).
This study was observational and so inferring causality is always difficult. Thanks to the excellent data availability I was able to replicate and attempt to validate the model.
Based on my analysis, this study does not provide good evidence for such a causal link, and there is evidence to believe that the observed correlation is spurious.
Model Details:
The paper uses a fairly simple model in combination with MCMC sampling to arrive at their estimates of the effectiveness of various interventions. This simple model allows them to combine multiple disparate data sources together to get an estimate combining multiple possible effects.
In order to arrive at its estimate of transmissibility the model considers the following parameters.
- Various NPIs such as school closing or restrictions on gathering at a regional level. 
- Regional Mobility (based on Google Mobility Reports) 
- Self Reported Mask Wearing (based on Facebook surveys) 
- A custom regional factor. 
- Random variation in transmissibility over time. 
It then computes the most likely distribution of weights for each of these factors based on the likelihood of matching the observed cases and deaths.
Observational Window:
One common criticism of this paper has been that the window was chosen specifically to make masks look better. If the analysis was extended to include the winter spike, masks would come out looking worse.
This was the critique I initially started out to analyze, but it does not appear to be true. When I extended the analysis window to December, masks still appeared to be effective. In fact the claimed effectiveness increased to 40% over that interval. The data on NPIs in the model is not as high quality for the full interval, but the effect being observed seems robust to a wider time range.
Regional Effects:
If mask wearing causes a drop in transmissibility, then regions with higher levels of mask wearing should observe lower growth rates. This figure from the paper makes the implication clear. In fact the model does not actually make that prediction. Instead (holding NPIs and mobility constant) regions with higher mask wearing are predicted to have higher growth.
This occurs because the custom regional factor learned in our model actually correlates positively with mask wearing.
If we apply the expected 25% reduction in transmissibility for complete mask wearing we see that the higher masked regions still have slightly higher transmissibility.
Relative vs Absolute Mask Wearing:
This analysis of the data leads to a seeming contradiction. Within a given region, increased mask usage is correlated with lower growth rates (the 25% claimed effectiveness), but when comparing across regions masks seem to be ineffective. Depending on the causal story you want to tell, either of these claims could be true.
It is possible that people wear masks most in places where COVID is most transmissible. This would explain why masks don’t appear effective when comparing across regions.
However it is also possible that the same factors cause both mask wearing to increase and transmissibility to decrease. For instance, if people wear masks in response to an observed spike in cases, then the population immunity caused by the spike will make masks appear to be effective even if they are not.
Model Variations:
In order to try and determine whether the effect was likely to be true I tried two variations on the experiment.
Uniform Regional Transmissibility:
The first experiment was to force all regions to share the same base transmissibility. This provided an estimate that masks had an effectiveness of −10% (less effective than nothing). This validates the basic concern, but does not address the confounder of high transmissibility regions causing mask wearing (which causes transmissibility to decrease).
No Mask Variation:
The next experiment was to force each regions to use a constant value for mask wearing (the average value in the time period). Although this would add noise to the estimate, it should distinguish between the two effects. In this experiment masks appeared to be reduce transmission, but the point estimate was only ~8% and was not significantly different from 0.
Data Extrapolation:
Another experiment would be to look at the difference between April and May. During this period mask usage did increase (by around 8 percentage points) and the growth rate did decline. But again there was little to no correlation between how much the mask usage increased, and how much the growth rate declined.
Conclusion:
The failure of large absolute differences in mask wearing across regions to meaningfully impact the observed growth rate, should make us skeptical of large claimed effects within a particular region. Clever statistical methods can make observational studies more powerful, but they are not sufficient to prove a causal story. A randomized trial underway in Bangladesh found that they could increase mask wearing by 30 percentage points. This randomization allows us to infer causality with much more confidence, and should provide a more definitive answer.
In contrast to the title of the post, looking at all available evidence, we actually do know that masks work.
Also I’m highly sceptical of claims of the form … “the failure of large absolute differences in variable X across regions to meaningfully impact the observed growth rate … should make us skeptical of large claimed effects”.
More real picture is something like
- there are just a few factors which have so large impact on R that is possible to “clearly see them”—the main ones are vaccination and more transmissible variants increasing R by factors like 1.6
- there are few things which have an impact on R like 40% (strongest NPIs like all gathering banned, seasonal amplitude, or large scale testing)
- there are many moderately strong factors which have an impact like 20% (masks, weaker NPIs, increased humidity,...)
Uncovering even the moderate sized effects requires modelling, and failure to see them clearly e.g. in regional comparisons should not make anyone any update apart from rejecting obviously implausibly large effect sizes (e.g. mask reducing transmission by 60%)
We have interventions such as Taffix or prophylatic ivermectin that have large effects in studies but where we don’t have a lot of good studies to have good certainty that the effects generalize. The meta review on Ivermectin gives ‘Low-certainty evidence found that ivermectin prophylaxis reduced COVID-19 infection by an average 86% (95% confidence interval 79%–91%)’ the Taffix study gives ‘Odds ratio for infection among Taffix users was 0.22, a reduction of 78% (95%CI 1%–95%).’
If we would do all the things where there are studies that suggest large effects or at least run additional studies to verify that the effect exists, then saying that “we know masks work” would be a lot more responsible then in a world where we don’t do those things.
If you buy that model there’s still the question why we force children in schools to wear masks but don’t force the schools to control their humidity levels. Humidity level control also have higher then 20% claimed effectiveness. If I for example Google “us school humidity” I find no hit about COVID-19 but websites that claim “Temporary humidity control is vital to keeping illnesses at bay in schools. Students in poorly ventilated rooms suffer 70% more respiratory illnesses.”
A more dakka approach would be to do all the things but that’s not what we are doing.
On a personal level it makes sense to take your taffix, ivermectin, wear a properly fitted respirator, do air quality control in your flat and vaccinate but the overall debate is still important.
Failure to see an effect doesn’t mean that effects are disproven but it does mean that we don’t know whether the effect exists.
Why we don’t have more studies on Taffix or increasing humidity in schools is not a matter of much attention seem like Inadequate equilibria type of problem, which seem quite distinct from evaluating existing evidence on some topic.
Sorry for too much brevity before.
No. Per Bayes theorem, failure to see an effect in an analysis/experiment where you would expect to see no effect no matter if the effect exists or not should make you to stay with the prior.
In the specific case of this topic and post—someone looking at masks clearly will likely have a prior “they work, but they aren’t a miracle cure”. More precisely, this could be expressed roughly as an expected effect distribution in R reduction space with almost all of the mass centered somewhere between 5% a 50%. Different reasonable observers looking at different data will likely arrive at somewhat different maximum likelihood estimates and shapes of the distribution, but they will have very little probability mass on no effect, or harm, and very little on large effect.
Should someone with a prior from this class update the prior, based on the evidence consisting of the analysis by Mike Harris?
Not at all! Posterior should stay the same.
Should someone with this prior update the prior, based on reading the referenced paper?
In my view yes, I think bayesians should update away even more from very low effects (like 5%) or very high effects (like 50%).
The point is that the evidence we have for Taffix and the evidence for humidity is better then the evidence we have for masks.
The prior before we run studies is that we don’t know whether or not masks work. Studies are the only way to move from “We think it’s likely that masks work” to “We know that masks work”.
When it comes to mask wearing there are multiple ways you can wear masks. You can wear a clothing mask, you can wear a surgical mask/non-sealing N95/KN95/FFP-2, a sealing N95/KN95/FFP-2, or you can wear a mask with even more protection.
It’s probably critical to distinguish between a non-respirator mask and a respirator (a mask that is sealed to the face and is supposed to filter most of the air that is being inhaled; and exhaled, if there is no exhalation valve).
For anyone whose model for COVID-19 transmission is based on what sources like the CDC’s website said prior to October: even the CDC’s website now says that “breathing in air when close to an infected person who is exhaling small droplets and particles that contain the virus” is a “main way” in which one can get infected with COVID-19 (they list that as the first item on their list of “three main ways” in which COVID-19 spreads). [EDIT: I don’t want people reading this to update that the risk is small if the infected person is not “close”, especially when talking about enclosed spaces.]
If you’re shedding the virus (being infectious, possibly before showing outward symptoms) and you’re wearing a mask, probably the virus is less likely to get to people around you in a supermarket or in public transportation? Therefore, masks work at least a bit? Isn’t that obvious? What am I missing?
Mechanistic explanations are good for priors, but don’t replace, much less refute, empirical evidence. If we see that there is ~0% impact in RCTs, the fact that we know better because it “must” decrease transmission isn’t relevant. And the mechanistic model could be wrong in many ways. For example, maybe people wear masks too poorly to matter (they do!) Maybe masks only help if people never take them off to blow their nose, or scratch it, or similar (they do all those things, too.)
And we see that even according to the paper, the impact is pretty small, so it mostly refutes the claim made by the mechanistic model you propose—the impact just isn’t that big, for various reasons. Which would imply that there is no way to know if it’s materially above 0%.
In fact, I think it’s likely that the impact is non-trivial in reducing transmission, but the OP is right that we don’t have strong evidence.
Right, I agree with all that. I wanted to express that my prior is high for masks working at least a bit. I find it weird to write as though the burden of proof is on finding solid evidence that they do work. I wouldn’t have commented if the OP had titled the post “We still can’t be certain if masks work.”
There are two separate questions: (1) Should you wear a mask. (2) Should you feel save when you wear a mask and are with other people that wear a mask.
When it comes to (1) the evidence that you need is more lower then for (2).
Agreement reached!
It might be that as you exhale more and more viruses get into the mask and while the mask filters, when you let clean air go through the mask filled with viruses that air will partly be filled with viruses, so over the span at of a few minutes you are still putting out 80% or virus particles.
Especially, with clothing masks that are not designed to actually do good filtering.
Mask quite clearly prevent droplets but when it comes to aerosol the case is less well (I have heard the hypothesis is that this is partly the reason why the WHO still hasn’t accepted that COVID-19 transmits via aerosol.
In the thread model of aerosol transmission a bunch of good airfilters might do more then a clothing mask.
This is an excellent element of this post, and I am very pleased to see it! These two possibilities are actually testable in a model that statistically controls (i.e., includes) transmissibility. Notice that in the first possibility, mask wearing is a (partial) mediator for the effect of transmissibility on cases. In the second, mask wearing and transmissibility are commonly-caused by something else(s). Either way, controlling the transmissibility path improves our estimate of the mask wearing path.
[Aside: what if your statistical control was “insufficient” and mask wearing still correlates with the error term in your model? Then your causal estimate loses some causal interpretability, but this critique is virtually impossible to completely address short of randomized experimentation. Still, you try your best and transmissibility here is the biggie that should account not just for itself but also soak up any common causes you might expect to be in the error term.]
This is why beginning-of-period transmissibility factors are included in the model to estimate end-of-period transmissibility results. It’s possible that mask wearing has a negative effect on cases but that it’s more than offset by the big positive direct effect of transmissibility on cases. If you’re interested in the correlation between mask wearing and cases, you’ll be disappointed by the net positive effect, which is of course not interpretable as causal. If you’re interested in the causal effect of mask wearing on cases, you’ll be encouraged by the negative effect and hope to find ways to increase mask wearing besides the epidemic just getting so much worse that people start to wear masks. This is the first possibility. This is also how the second possibility looks statistically—we can still get our estimate of the effect of mask wearing, which ultimately is the focus of the investigation. But whether the first or second possibility is “true” may be relevant not so much for estimating the effect of mask wearing (it’s statistically the same—control for transmissibility), but for getting a wider understanding of the world (does transmissibility cause mask wearing or does something else cause both?, which is not the focus of the study).
I also want to note that the endogeneity critiques of observational (vs. experimental) methods are legit, but there is a lot that can be (and is) done to draw “more causal” conclusions from observational data than mere correlation, and experiments can have their own internal validity concerns, so both approaches are useful for learning about the world.
For the reasons you describe, it seems to me futile to assess the efficacy of masks from observational data. Has anyone looked at the actual mechanism, the filtering out of particles of certain sizes, studying the size range of virus-carrying particles, the different mask materials, and so on? That seems a lot more useful than ever-higher piles of statistics. I remember seeing studies about that about a year ago, but nothing recently.
I am wondering: how much protection woudl be/have been lost by 1) making masks mandatory for symptomatic people rather than 2) for everyone?
My current understanding is that masks work by keeping you from spreading virus. If you don’t have the virus, wearing a mask is useless. So with 1) the only protection lost would be from asymptomatic people. OTHO, the social and economic costs would be/have been much lower.
Also, could 1) have possibly given a slight selective advantage to more benign variants over harmful ones? Some diseases can be harmful while keeping mostly asomptomatic for a long time, but I don’t know if coronaviruses could.
Edit: Thanks for the reply. Here is what I meant for 1) in more detail: set a list of symptoms, like temperature, runny nose, etc., and if someone has any symptom, however mild, they have to wear a mask. This should include any symptomatic person, however pauci-.
It’s not true that the only protection lost would be from asymptomatic people, though that would still be a big deal if a quarter of cases are asymptomatic and R is above 4, which it likely is in a population taking no other precautions. And even without masks, people who actively feel very sick often seek treatment and are diagnosed, and when not, mostly aren’t going out in public much. But there are two other groups that matter;
Presymptomatic spread is a big deal for COVID, and accounts for much of why it spreads quickly. That’s why we saw such short serial transmission intervals. And if you don’t eliminate the rapid spread, you’re not getting much benefit from masks.
Paucisymptomatic people, who have a slight runny nose or temperature and nothing else, are fairly common, might not notice, or will assume it’s not COVID, since it’s mild, and spread the virus. (And this category partly overlaps with the previous one—people often start manifesting minor symptoms before they notice all of them.)
That’s an overstatement, by my understanding. Masks are better at stopping outgoing germs than incoming ones, but they still do some good for both directions.
Why do you believe there wasn’t a significant effect of mask wearing in the military recruits if not because the masks didn’t protect the wearer?
“All recruits wore double-layered cloth masks at all times indoors and outdoors, except when sleeping or eating; practiced social distancing of at least 6 feet; were not allowed to leave campus; did not have access to personal electronics and other items that might contribute to surface transmission; and routinely washed their hands. They slept in double-occupancy rooms with sinks, ate in shared dining facilities, and used shared bathrooms. All recruits cleaned their rooms daily, sanitized bathrooms after each use with bleach wipes, and ate preplated meals in a dining hall that was cleaned with bleach after each platoon had eaten. Most instruction and exercises were conducted outdoors. All movement of recruits was supervised, and unidirectional flow was implemented, with designated building entry and exit points to minimize contact among persons.”
Looks like they did quite a bit more than “wear masks”. Plus, note double-layered masks...
Yes, given the amount that they did you would expect a clear effect if masks do a decent job at protecting wearers.
One possible explanation is that COVID-19 mostly spreads through aerosol transmissions where a cloth masks doesn’t do a good job.
If that scenario is true measures for air quality are more important for mask wearing. I still believe that there are conditions where it makes sense to wear a mask because of the precautionary principle, but as the title of this post suggests the evidence really isn’t that clear.
The most important bit here is not “double-layered”; it’s “all recruits”. There was no unmasked group for comparison, so this study tells us nothing about mask effectiveness beyond “some people still got infected, so they’re less than 100% effective”.
What military recruits are you talking about? I didn’t see any reference to the military.
This sounds like you basically don’t believe in Evidence-Based Medicine and form your believe based on simple pathopathological models (the thing that Evidence-Based Medicine was about to fight) or don’t care about mask wearing to form an informed opinion about it.
To recap, there are actual studies on mask wearing. One of them is SARS-CoV-2 Transmission among Marine Recruits during Quarantine. The benefit of studying military recruits is that it’s likely the best group for complience to policies as the training instructors made sure that they were wearing their masks.
I now this might sound harsh but “I think you are wrong because of “one sentence pathopathological models’” is not the kind of argument I like to see on LessWrong from people who haven’t done any research to be familiar with the topic.
It’s not that I believe that you should always reason in an Evidence-Based manner, but be at least a bit more sophisticated about it.
It’s not that I “don’t believe in Evidence-Based Medicine”, it’s that you didn’t mention in your first comment that your were talking about a different study, so I really didn’t know what you were talking about. Thanks for giving the link.
The Marine study doesn’t address the effects of masks. Both the participants and nonparticipants wore masks. The actual difference between those groups was that the participants were asked about symptoms, tested, and isolated if positive at day 0, 7, and 14, versus only on day 14 for nonparticipants. It gives us some (unsurprising) evidence that surveillance testing and isolation helps: on day 14, at least 11/1760 (0.6%) and possibly as many as 22/1847 (1.2%) participants were positive, compared to 26/1554 (1.7%) nonparticipants. Unfortunately the reporting is not great, so we don’t know exactly how many participants were positive on day 14. And this is pretty weak evidence: we don’t know how many of the nonparticipants would have tested positive at day 0, so it’s hard to say how much of the day-14 difference was due to weeding out infected participants versus the participants possibly starting with a lower infection rate.
Correction: for participants on day 14, it was somewhere between 11 and 33 out of 1847 (0.6%-1.8%). Not that it makes much of a difference.