I don’t think the evidence provided substantiates the conclusions in this post. Overall it looks like you’ve demonstrated this:
They didn’t commit obvious fraud (last digit test)
If you remove all 14 of their control variables, this removes much of their statistical power, so the results are only marginally significant (p = 0.026981 overall, compare p < 0.01 in many independent categories in the original)
If you remove all 14 of their control variables AND change from Poisson to negative binomial, the results are no longer significant. (However, if you include the control variables and use negative binomial the results are still significant: Web Table 12 in appendix)
Furthermore you only examine all cause mortality, whereas the study examines deaths from specific diseases. Your replication will be less powered because even in non-high-income countries most people die from things like heart disease and stroke, reducing your power to detect even large changes in malaria or HIV deaths caused by USAID. Removing all 14 controls will kill your statistical power further, because the presence of things like “adequate sanitation” and “health expenditure” will definitely affect the mortality rate.
Finally, in your conclusion, it makes no sense to say “some of the choices they make to strengthen the result end up being counterproductive”—this is weak evidence against p-hacking that to my mind cancels the weak evidence you found for p-hacking! This is super unvirtuous, don’t just assume they were p-hacking and say your belief is confirmed when you found some weak evidence for and some weak evidence against.
Off the top of my head, here are the results I’d want to see from someone skeptical of this study, in increasing order of effort. I agree these would be much easier if they shared their full data.
Think about whether the causal story makes sense (is variance in USAID funding exogenous and random-ish?)
Add year-based fixed effects
Some kind of exploratory analysis to decide whether the negative binomial model is appropriate. I have never fit a negative binomial model but here’s what o3 suggests
Use a Poisson model with robust standard errors
Multiverse analysis where you pick 20 plausible control variables, control for 10 of them, and see how many are significant
Now some comments in no particular order:
First thanks for including the equations, it makes the post a lot easier to understand.
when the world’s foremost medical journal publishes work indicating that a policy’s results will be comparable in death count to history’s engineered mass-starvations by year’s end, I’m going to be skeptical.
I’m as skeptical of this as any other study, but we should beware the absurdity heuristic and instead think about the actual claimed mechanism. There are LOTS more people in LMICs in the world today than in 1932 Ukraine and we know that millions die from preventable diseases per year; we also know that one of the main goals of USAID is to reduce deaths from preventable diseases, and that spending enough resources on preventable diseases can basically eliminate them.
There are at least two reasons I don’t think these cases are similar.
The link you shared is about inherently harder modeling than this Lancet study. Ferguson’s model models the exponential spread of transmissible diseases, which inherently has orders of magnitude wide confidence intervals in the final number of deaths, especially for median vs worst case scenarios. The only thing these studies share is they’re using the extremely common technique of Monte Carlo modeling, and they’re about disease.
The prediction that 500k people would die without a lockdown in the UK, and 2M in the US, isn’t too far off, given that even with partial lockdowns + vaccination 232k died in the UK and 1.19M in the US, plus there was limited information at the time
It’s way too much work for me to go find and recreate all of the 18(!) control variables they use.
They have 14 control variables not 18: 4 time based controls and 10 other controls.
The per year binning also torpedoes their results.
This is interesting, I’d like to see it with the full methodology
It looks a slightly better fit from a BIC perspective. The second quartile drops from significance and the third quartile is barely significant. Overall, this is still a much weaker result than what they report, but the binning itself doesn’t change a lot about the regression.
Note their ** means p < 0.05 and their *** means p < 0.01, which is different from your annotations
The weird thing about this paper is its relation to Regression 1. In my reproduction that doesn’t have a gazillion controls, there is a significant relationship between per capita USAID spend and overall mortality. For whatever reason this wasn’t enough and some of the choices they make to strengthen the result end up being counterproductive. In both the categorical binning issue and the choice to use Poisson, the alternative choices make the results null, which I consider to be fairly strong evidence of p-hacking.
My understanding is that when you replaced their experiment with a simpler, much less powerful experiment, then did a cursory sensitivity analysis, the results were sometimes significant and sometimes not significant. This is totally expected. Also it makes no sense to say “some of the choices they make to strengthen the result end up being counterproductive”—this is weak evidence against p-hacking that to my mind cancels the weak evidence you found for p-hacking! This is super unvirtuous, don’t just assume they were p-hacking and say your belief is confirmed when you found some weak evidence for and some weak evidence against.
Also, the results are significant in so many categories that I disagree with p-hacking being likely here. If you aggregated the results in some principled way, you would get some crazy low value like 10^-50, so it can’t be a multiple comparisons issue. As for the controls, they seem plausible on priors (I’m not familiar enough to know if they’re standard), and seem to improve the predictive power of the model.
Thanks for the thoughtful comment. I’ll try to address these remarks in order. You state
Furthermore you only examine all cause mortality, whereas the study examines deaths from specific diseases.
They also use overall mortality (Web Table 10), which is what I was trying to reproduce and screenshotted. The significance figures aren’t really different than those for the regressions broken down by mortality cause (Web Table 15), but the rate ratios for the all cause mortality ratios are clearly smaller in the disaggregated regressions because people die from other stuff. I mostly ignored the rate ratio sizes and focused on the significances here, but agree the most powerful effects probably result from the disaggregated regressions.
Removing all 14 controls will kill your statistical power further, because the presence of things like “adequate sanitation” and “health expenditure” will definitely affect the mortality rate. … they seem plausible on priors (I’m not familiar enough to know if they’re standard), and seem to improve the predictive power of the model.
This is a fundamental misunderstanding of how controls work and what they’re supposed to do. Just yesterday Cremieux wrote a pretty good piece on this very topic. The authors include these controls with little thought to the underlying causal mechanism. Their only remarks on them at all in the supplemental material are
These controls, combined with country fixed-effects models and other isolation factors, help refine the relationship between USAID per capita and changes in mortality over time.
This isn’t necessarily true at all. Consider one of the controls: Health expenditure as a percent of GDP. Is that a confounder, influencing USAID spend levels and mortality rates simultaneously? Is it a collider, caused by increased or decreased USAID spend and mortality changes? In the former case, yes, go ahead and control for it, but if it’s the latter, it screws up the analysis. Both are plausible. The authors consider neither.
I did apparently miscount the controls. It’s unclear why their own spec on page 13 miscounts them as well.
You mention the Monte Carlo simulations aren’t comparable. This is a fair, and I really like the explanation. I didn’t really touch on that aspect of this analysis in the post, but you’ve persuaded me I’m making a cheap point.
“Also it makes no sense to say “some of the choices they make to strengthen the result end up being counterproductive”.
This was unclear, and I regret it. I meant counterproductive in the sense of complicating the analysis. I’m still not clear how they got such strong, consistent results. My suspicions is careful control selection as alluded to above.
I don’t think the evidence provided substantiates the conclusions in this post. Overall it looks like you’ve demonstrated this:
They didn’t commit obvious fraud (last digit test)
If you remove all 14 of their control variables, this removes much of their statistical power, so the results are only marginally significant (p = 0.026981 overall, compare p < 0.01 in many independent categories in the original)
If you remove all 14 of their control variables AND change from Poisson to negative binomial, the results are no longer significant. (However, if you include the control variables and use negative binomial the results are still significant: Web Table 12 in appendix)
Furthermore you only examine all cause mortality, whereas the study examines deaths from specific diseases. Your replication will be less powered because even in non-high-income countries most people die from things like heart disease and stroke, reducing your power to detect even large changes in malaria or HIV deaths caused by USAID. Removing all 14 controls will kill your statistical power further, because the presence of things like “adequate sanitation” and “health expenditure” will definitely affect the mortality rate.
Finally, in your conclusion, it makes no sense to say “some of the choices they make to strengthen the result end up being counterproductive”—this is weak evidence against p-hacking that to my mind cancels the weak evidence you found for p-hacking! This is super unvirtuous, don’t just assume they were p-hacking and say your belief is confirmed when you found some weak evidence for and some weak evidence against.
Off the top of my head, here are the results I’d want to see from someone skeptical of this study, in increasing order of effort. I agree these would be much easier if they shared their full data.
Think about whether the causal story makes sense (is variance in USAID funding exogenous and random-ish?)
Add year-based fixed effects
Some kind of exploratory analysis to decide whether the negative binomial model is appropriate. I have never fit a negative binomial model but here’s what o3 suggests
Use a Poisson model with robust standard errors
Multiverse analysis where you pick 20 plausible control variables, control for 10 of them, and see how many are significant
Now some comments in no particular order:
First thanks for including the equations, it makes the post a lot easier to understand.
I’m as skeptical of this as any other study, but we should beware the absurdity heuristic and instead think about the actual claimed mechanism. There are LOTS more people in LMICs in the world today than in 1932 Ukraine and we know that millions die from preventable diseases per year; we also know that one of the main goals of USAID is to reduce deaths from preventable diseases, and that spending enough resources on preventable diseases can basically eliminate them.
There are at least two reasons I don’t think these cases are similar.
The link you shared is about inherently harder modeling than this Lancet study. Ferguson’s model models the exponential spread of transmissible diseases, which inherently has orders of magnitude wide confidence intervals in the final number of deaths, especially for median vs worst case scenarios. The only thing these studies share is they’re using the extremely common technique of Monte Carlo modeling, and they’re about disease.
The prediction that 500k people would die without a lockdown in the UK, and 2M in the US, isn’t too far off, given that even with partial lockdowns + vaccination 232k died in the UK and 1.19M in the US, plus there was limited information at the time
They have 14 control variables not 18: 4 time based controls and 10 other controls.
This is interesting, I’d like to see it with the full methodology
Note their
**
means p < 0.05 and their***
means p < 0.01, which is different from your annotationsMy understanding is that when you replaced their experiment with a simpler, much less powerful experiment, then did a cursory sensitivity analysis, the results were sometimes significant and sometimes not significant. This is totally expected. Also it makes no sense to say “some of the choices they make to strengthen the result end up being counterproductive”—this is weak evidence against p-hacking that to my mind cancels the weak evidence you found for p-hacking! This is super unvirtuous, don’t just assume they were p-hacking and say your belief is confirmed when you found some weak evidence for and some weak evidence against.
Also, the results are significant in so many categories that I disagree with p-hacking being likely here. If you aggregated the results in some principled way, you would get some crazy low value like 10^-50, so it can’t be a multiple comparisons issue. As for the controls, they seem plausible on priors (I’m not familiar enough to know if they’re standard), and seem to improve the predictive power of the model.
Thanks for the thoughtful comment. I’ll try to address these remarks in order. You state
They also use overall mortality (Web Table 10), which is what I was trying to reproduce and screenshotted. The significance figures aren’t really different than those for the regressions broken down by mortality cause (Web Table 15), but the rate ratios for the all cause mortality ratios are clearly smaller in the disaggregated regressions because people die from other stuff. I mostly ignored the rate ratio sizes and focused on the significances here, but agree the most powerful effects probably result from the disaggregated regressions.
This is a fundamental misunderstanding of how controls work and what they’re supposed to do. Just yesterday Cremieux wrote a pretty good piece on this very topic. The authors include these controls with little thought to the underlying causal mechanism. Their only remarks on them at all in the supplemental material are
This isn’t necessarily true at all. Consider one of the controls: Health expenditure as a percent of GDP. Is that a confounder, influencing USAID spend levels and mortality rates simultaneously? Is it a collider, caused by increased or decreased USAID spend and mortality changes? In the former case, yes, go ahead and control for it, but if it’s the latter, it screws up the analysis. Both are plausible. The authors consider neither.
I did apparently miscount the controls. It’s unclear why their own spec on page 13 miscounts them as well.
You mention the Monte Carlo simulations aren’t comparable. This is a fair, and I really like the explanation. I didn’t really touch on that aspect of this analysis in the post, but you’ve persuaded me I’m making a cheap point.
“Also it makes no sense to say “some of the choices they make to strengthen the result end up being counterproductive”.
This was unclear, and I regret it. I meant counterproductive in the sense of complicating the analysis. I’m still not clear how they got such strong, consistent results. My suspicions is careful control selection as alluded to above.