If the effect is so small that a sample of several thousand is not sufficient to reliably observe it, then it doesn’t even matter that it is positive. An analogy: Suppose I tell you that eating garlic daily increases your IQ, and point to a study with three million participants and P < 1e-7. Vastly significant, no? Now it turns out that the actual size of the effect is 0.01 points of IQ. Are you going to start eating garlic? What if it weren’t garlic, but a several-billion-dollar government health program? Statistical significance is indeed not everything, but there’s such a thing as considering the size of an effect, especially if there’s a cost involved.
Moreover, please consider that “consistent with zero” means exactly that. If you throw a die ten times and it comes up heads six, do you “hesitantly update a very tiny bit” in the direction of the coin being biased? Would you do so, if you did not have a prior reason to hope that the coin was biased?
I respectfully suggest that you are letting your already-written bottom line interfere with your math.
If the effect is so small that a sample of several thousand is not sufficient to reliably observe it, then it doesn’t even matter that it is positive.
I strongly disagree.
An old comment of mine gives us a counterexample. A couple of years ago, a meta-analysis of RCTs found that taking aspirin daily reduces the risk of dying from cancer by ~20% in middle-aged and older adults. This is very much a practically significant effect, and it’s probably an underestimate for reasons I’ll omit for brevity — look at the paper if you’re curious.
If you do look at the paper, notice figure 1, which summarizes the results of the 8 individual RCTs the meta-analysis used. Even though all of the RCTs had sample sizes in the thousands, 7 of them failed to show a statistically significant effect, including the 4 largest (sample sizes 5139, 5085, 3711 & 3310). The effect is therefore “so small that a sample of several thousand is not sufficient to reliably observe it”, but we would be absolutely wrong to infer that “it doesn’t even matter that it is positive”!
The heuristic that a hard-to-detect effect is probably too small to care about is a fair rule of thumb, but it’s only a heuristic. EHeller & Unnamed are quite right to point out that statistical significance and practical significance correlate only imperfectly.
Does vitamin D reduce all-cause mortality in the elderly? The point-estimates from pretty much all of the various studies are around a 5% reduction in risk of dying for any reason—pretty nontrivial, one would say, no? Yet the results are almost all not ‘statistically significant’! So do we follow Rolf and say ‘fans of vitamin D ought to update on vitamin D not helping overall’… or do we, applying power considerations about the likelihood of making the hard cutoffs at p<0.05 given the small sample sizes & plausible effect sizes, note that the point-estimates are in favor of the hypothesis? (And how does this interact with two-sided tests—vitamin D could’ve increased mortality, after all. Positive point-estimates are consistent with vitamin D helping, and less consistent with no effect, and even less consistent with it harming; so why are we supposed to update in favor of no help or harm when we see a positive point-estimate?)
If we accept Rolf’s argument, then we’d be in the odd position of, as we read through one non-statistically-significant study after another, decreasing the probability of ‘non-zero reduction in mortality’… right up until we get the Autier or Cochrane data summarizing the exact same studies & plug it into a Bayesian meta-analysis like Salvatier did & abruptly flip to ’92% chance of non-zero reduction in mortality’.
A couple of years ago, a meta-analysis of RCTs found that taking aspirin daily reduces the risk of dying from cancer by ~20% in middle-aged and older adults.
That’s a curious metric to choose. By that standard taking aspirin is about as healthy as playing a round of Russian Roulette.
It’s a fairly natural metric to choose if one wishes to gauge aspirin’s effect on cancer risk, as the study’s authors did.
By that standard taking aspirin is about as healthy as playing a round of Russian Roulette.
Fortunately, the study’s authors and I also interpreted the data by another standard. Daily aspirin reduced all-cause mortality, and didn’t increase non-cancer deaths (except for “a transient increase in risk of vascular death in the aspirin groups during the first year after completion of the trials”). These are not results we would see if aspirin effected its anti-cancer magic by a similar mechanism to Russian Roulette.
It’s a fairly natural metric to choose if one wishes to gauge aspirin’s effect on cancer risk, as the study’s authors did.
Pardon me. Mentioning only curiosity was politeness. The more significant meanings I would supplement with are ‘naive or suspicious’. By itself that metric really is worthless and reading this kind of health claim should set off warning bells. Lost purposes are a big problem when it comes to medicine. Partly because it is hard, mostly because there is more money in the area than nearly anywhere else.
Fortunately, the study’s authors and I also interpreted the data by another standard. Daily aspirin reduced all-cause mortality, and didn’t increase non-cancer deaths (except for “a transient increase in risk of vascular death in the aspirin groups during the first year after completion of the trials”).
And this is the reason low dose asprin is part of my daily supplement regime (while statins are not).
And this is the reason low dose asprin is part of my daily supplement regime (while statins are not).
I recently stopped with the low dose aspirin, the bleeding when I accidentally cut myself has proven to be too much of an inconvenience. For the time being, at least.
I’d assume they mean something like the per-year risk of dying from cancer conditional on previous survival—if they indeed mean the total lifetime risk of dying from cancer I agree it’s ridiculous.
Yeah, pretty much. There are other examples of this where something harmful appears to be helpful when you don’t take into account possible selection biases (like being put into the ‘non-cancer death’ category); for example, this is an issue in smoking—you can find various correlations where smokers are healthier than non-smokers, but this is just because the unhealthier smokers got pushed over the edge by smoking and died earlier.
If the effect is so small that a sample of several thousand is not sufficient to reliably observe it, then it doesn’t even matter that it is positive.
Have you read the study in question? The treatment sample is NOT several thousand, its about 1500. Further, the incidence of the diseases being looked at are only a few percent or less, so the treatment sample sizes for the most prevalent diseases are around 50 (also, if you look at the specifics of the sample, the diseased groups are pretty well controlled).
I suggest the following exercise- ask yourself what WOULD be a big effect, and then work through if the study has the power to see it.
Moreover, please consider that “consistent with zero” means exactly that.
Yes, but in this case, the sample sizes are small and the error bars are so large that consistent with zero is ALSO consistent with 25+ % reduction in incidence (which is a large intervention). The study is incapable from distinguishing hugely important effect from 0 effect, so we shouldn’t update much at all, which is why I wished Mcardle had talked about statistical power. Before we ask “how should we update”, we should ask “what information is actually here?”
Edit: If we treat this as an exploration, it says “we need another study”- after all the effects could be as large as 40%! Thats a potentially tremendous intervention. Unfortunately, its unethical to randomly boot people off of insurance so we’ll likely never see that study done.
If the effect is so small that a sample of several thousand is not sufficient to reliably observe it, then it doesn’t even matter that it is positive. [...] Statistical significance is indeed not everything, but there’s such a thing as considering the size of an effect, especially if there’s a cost involved.
Health is extremely important—the statistical value of a human life is something like $8 million—so smallish looking effects can be practically relevant. An intervention that saves 1 life out of every 10,000 people treated has an average benefit of $800 per person. In this Oregon study, people who received Medicaid cost an extra $1,172 per year in total health spending, so the intervention would need to save 1.5 lives per 10,000 person-years (or provide an equivalent benefit in other health improvements) for the health benefits to balance out the health costs. The study looked at fewer than 10,000 people over 2 years, so the cost-benefit cutoff for whether it’s worth it is less than 3 lives saved (or equivalent).
So “not statistically significant” does not imply unimportant, even with a sample size of several thousand. An effect at the cost-benefit threshold is unlikely to show up in significant changes to mortality rates. The intermediate health measures in this study are more sensitive to changes than mortality rate, but were they sensitive enough? Has anyone run the numbers on how sensitive they’d need to be in order to find an effect of this size? The point estimates that they did report are (relative to control group) an 8% reduction in number of people with elevated blood pressure, 17% reduction in number of people with high cholesterol, and 18% reduction in number of people with high glycated hemoglobin levels (a marker of diabetes), which intuitively seem big enough to be part of an across-the-board health improvement that passes cost-benefit muster.
which intuitively seem big enough to be part of an across-the-board health improvement that passes cost-benefit muster.
This would be much more convincing if you reported the costs along with the benefits, so that one could form some kind of estimate of what you’re willing to pay for this. But, again, I think your argument is motivated. “Consistent with zero” means just that; it means that the study cannot exclude the possibility that the intervention was actively harmful, but they had a random fluctuation in the data.
I get the impression that people here talk a good game about statistics, but haven’t really internalised the concept of error bars. I suggest that you have another look at why physics requires five sigma. There are really good reasons for that, you know; all the more so in a mindkilling-charged field.
I was responding to the suggestion that, even if the effects that they found are real, they are too small to matter. To me, that line of reasoning is a cue to do a Fermi estimate to get a quantitative sense of how big the effect would need to be in order to matter, and how that compares to the empirical results.
I didn’t get into a full-fledged Fermi estimate here (translating the measures that they used into the dollar value of the health benefits), which is hard to do that when they only collected data on a few intermediate health measures. (If anyone else has given it a shot, I’d like to take a look.) I did find a couple effect-size-related numbers for which I feel like I have some intuitive sense of their size, and they suggest that that line of reasoning does not go through. Effects that are big enough to matter relative to the costs of additional health spending (like 3 lives saved in their sample, or some equivalent benefit) seem small enough to avoid statistical significance, and the point estimates that they found which are not statistically significant (8-18% reductions in various metrics) seem large enough to matter.
My overall conclusion about the (based on what I know about it so far) study is that it provides little information for updating in any direction, because of those wide error bars. The results are consistent with Medicaid having no effect, they’re consistent with Medicaid having a modest health benefit (e.g., 10% reduction in a few bad things), they’re consistent with Medicaid being actively harmful, and they’re consistent with Medicaid having a large benefit (e.g. 40% reduction in many bad things). The likelihood ratios that the data provide for distinguishing between those alternatives are fairly close to one, with “modest health benefit” slightly favored over the more extreme alternatives.
Again, the original point McArdle is making is that “consistent with zero” is just completely not what the proponents expected beforehand, and they should update accordingly. See my discussion with TheOtherDave, below. A small effect may, indeed, be worth pursuing. But here we have a case where something fairly costly was done after much disagreement, and the proponents claimed that there would be a large effect. In that case, if you find a small effect, you ought not to say “Well, it’s still worth doing”; that’s not what you said before. It was claimed that there would be a large effect, and the program was passed on this basis. It is then dishonest to turn around and say “Ok, the effect is small but still worthwhile”. This ignores the inertia of political programs.
Most Medicaid proponents did not have expectations about the statistical results of this particular study. They did not make predictions about confidence intervals and p values for these particular analyses. Rather, they had expectations about the actual benefit of Medicaid.
You cite Ezra Klein as someone who expected that Medicaid would drastically reduce mortality; Klein was drawing his numbers from a report which estimated that in the US “137,000 people died from 2000 through 2006 because they lacked health insurance, including 22,000 people in 2006.” There were 47 million uninsured Americans in 2006, so those 22,000 excess deaths translate into 4.7 excess deaths per 10,000 uninsured people each year. So that’s the size of the drastic reduction in mortality that you’re referring to: 4.7 lives per 10,000 people each year. (For comparison, in my other comment I estimated that the Medicaid expansion would be worth its estimated cost if it saved at least 1.5 lives per 10,000 people each year or provided an equivalent benefit.)
Did the study rule out an effect as large as this drastic reduction of 4.7 per 10,000? As far as I can tell it did not (I’d like to see a more technical analysis of this). There were under 10,000 people in the study, so I wouldn’t be surprised if they missed effects of that size. Their point estimates, of an 8-18% reduction in various bad things, intuitively seem like they could be consistent with an effect that size. And the upper bounds of their confidence intervals (a 40%+ reduction in each of the 3 bad things) intuitively seem consistent with a much larger effect. So if people like Klein and Drum had made predictions in advance about the effect size of the Oregon intervention, I suspect that their predictions would have fallen within the study’s confidence interval.
There are presumably some people who did expect the results of the study to be statistically significant (otherwise, why run the study?), and they were wrong. But this isn’t a competition between opponents and proponents where every slipup by one side cedes territory to the other side. The data and results are there for us to look at, so we can update based on what the study actually found instead of on which side of the conflict fought better in this battle. In this case, it looks like the correct update based on the study (for most people, to a first approximation) is to not update at all. The confidence interval for the effects that they examined covers the full range of results that seemed plausible beforehand (including the no-effect-whatsoever hypothesis and the tens-of-thousands-of-lives-each-year hypothesis), so the study provides little information for updating one’s priors about the effectiveness of Medicaid.
For the people who did make the erroneous prediction that the study would find statistically significant results, why did they get it wrong? I’m not sure. A few possibilities: 1) they didn’t do an analysis of the study’s statistical power (or used some crude & mistaken heuristic to estimate power), 2) they overestimated how large a health benefit Medicaid would produce, 3) the control group in Oregon turned out to be healthier than they expected which left less room for Medicaid to show benefits, 4) fewer members of the experimental group than they expected ended up actually receiving Medicaid, which reduced the actual sample size and also added noise to the intent-to-treat analysis (reducing the effective sample size).
I do want to point out that, while I agree with your general points, I think that unless the proponents put numerical estimates up beforehand, it’s not quite fair to assume they meant “it will be statistically significant in a sample size of N at least 95% of the time.” Even if they said that, unless they explicitly calculated N, they probably underestimated it by at least one order of magnitude. (Professional researchers in social science make this mistake very frequently, and even when they avoid it, they can only very rarely find funding to actually collect N samples.)
I haven’t looked into this study in depth, so semi-related anecdote time: there was recently a study of calorie restriction in monkeys which had ~70 monkeys. The confidence interval for the hazard ratio included 1 (no effect), and so they concluded no statistically significant benefit to CR on mortality, though they could declare statistically significant benefit on a few varieties of mortality and several health proxies.
I ran the numbers to determine the power; turns out that they couldn’t have reliably noticed the effects of smoking (hazard ratio ~2) on longevity with a study of ~70 monkeys, and while I haven’t seen many quoted estimates of the hazard ratio of eating normally compared to CR, I don’t think there are many people that put them higher than 2.
When you don’t have the power to reliably conclude that all-cause mortality decreased, you can eke out some extra information by looking at the signs of all the proxies you measured. If insurance does nothing, we should expect to see the effect estimates scattered around 0. If insurance has a positive effect, we should expect to see more effect estimates above 0 than below 0, even though most will include 0 in their CI. (Suppose they measure 30 mortality proxies, and all of them show a positive effect, though the univariate CI includes 0 for all of them. If the ground truth was no effect on mortality proxies, that’s a very unlikely result to see; if the ground truth was a positive effect on mortality proxies, that’s a likely result to see.)
I ran the numbers to determine the power; turns out that they couldn’t have reliably noticed the effects of smoking (hazard ratio ~2) on longevity with a study of ~70 monkeys, and while I haven’t seen many quoted estimates of the hazard ratio of eating normally compared to CR, I don’t think there are many people that put them higher than 2.
If I remember correctly, I noticed an effect that did give a p of slightly less than .05 was a hazard ratio of 3, which made me think of running that test, and then I think spower was the r function that I used to figure out what p they could get for a hazard ratio of 2 and 35 experimentals and 35 controls (or whatever the actual split was- I think it was slightly different?).
So you were using Hmisc::spower… I’m surprised that there was even such a function (however obtusely named) - why on earth isn’t it in the survival library?
I was going to try to replicate that estimate, but looking at the spower documentation, it’s pretty complex and I don’t think I could do it without the original paper (which is more work than I want to do).
It is of course very difficult to extract any precise numbers from a political discussion. :) However, if you click through some of the links in the article, or have a look at the followup from today, you’ll find McArdle quoting predictions of tens of thousands of preventable deaths yearly from non-insured status. That looks to me like a pretty big hazard rate, no?
you’ll find McArdle quoting predictions of tens of thousands of preventable deaths yearly from non-insured status. That looks to me like a pretty big hazard rate, no?
No. The Oracle says there’re about 50 million Americans without health insurance. The predictions you quoted refer to 18,000 or 27,000 deaths for want of insurance per year. The higher number implies only a 0.054% death rate per year, or a 3.5% death rate over 65 years (Americans over 65 automatically get insurance). This is non-negligible but hardly huge (and potentially important for all that).
The higher number implies only a 0.054% death rate per year
Eyeballing the statistics, that looks like a hazard ratio between 1.1 and 1.5 (lots of things are good predictors for mortality that you would want to control for that I haven’t; the more you add, the closer that number should get to 1.1).
If you throw a die ten times and it comes up heads six, do you “hesitantly update a very tiny bit” in the direction of the coin being biased?
If I throw a die once and it comes up heads I’m going to be confused. Now, assuming you meant “toss a coin and it comes up heads six times out of ten”.
What is your intended ‘correct’ answer to the question? I think I would indeed hesitantly update a very (very) tiny bit in the direction of the coin being biased but different priors regarding the possibility of the coin being biased in various ways and degrees could easily make the update be towards not-biased. I’d significantly lower p(the coin is biased by having two heads) but very slightly raise p(the coin is slightly heavier on the tails side), etc.
My intended correct answer is that, on this data, you technically can adjust your belief very slightly; but because the prior for a biased coin is so tiny, the update is not worth doing. The calculation cost way exceeds any benefit you can get from gruel this thin. I would say “Null hypothesis [ie unbiased coin] not disconfirmed; move along, nothing to see here”. And if you had a political reason for wishing the coin to be biased towards heads, then you should definitely not make any such update; because you certainly wouldn’t have done so, if tails had come up six times. In that case it would immediately have been “P-level is in the double digits” and “no statistical significance means exactly that” and “with those errors we’re still consistent with a heads bias”.
My intended correct answer is that, on this data, you technically can adjust your belief very slightly; but because the prior for a biased coin is so tiny, the update is not worth do
I would think that our prior for “health care improves health” should be quite a bit larger than the prior for a coin to be biased.
Hanson’s point is that we often over-treat to show we care- not that 0 health care is optimal. Medicaid patients don’t really have to worry about overtreatment.
Hanson’s point is that we often over-treat to show we care- not that 0 health care is optimal
I was interpreting “health care improves health” as “healthcare improves health on the margin.” Is this not what was meant?
Medicaid patients don’t really have to worry about overtreatment.
As someone who has a start-up in the healthcare industry, this runs counter to my personal experience. Also, currently “medicaid overtreatment” is showing about 676,000 results on Google (while “medicaid undertreatment” is showing about 1,240,000 results). Even if it isn’t typical, it surely isn’t an unheard-of phenomenon.
I was interpreting “health care improves health” as “healthcare improves health on the margin.” Is this not what was meant?
No, I meant going from 0 access to care to some access to care improves health, as we are discussing the medicaid study comparing people on medicaid to the uninsured.
As someone who has a start-up in the healthcare industry, this runs counter to my personal experience.
I currently work as a statistician for a large HMO, and I can tell you for us, medicaid patients generally get the ‘patch-you-up-and-out-the-door’ treatment because odds are high we won’t be getting reimbursed in any kind of timely fashion. I’ve worked in a few states, and it seems pretty common for medicaid to be fairly underfunded (hence the Oregon study we are discussing).
And generally, providing medicaid is moving someone from emergency-only to some-primary-care, which is where we should expect some impact- this isn’t increasing treatment on the margin, its providing minimal care to a largely untreated population.
Currently, “medicaid overtreatment” is showing about 676,000 results on Google
So I randomly sampled ~5 in the first two pages, and 3 of those were articles about overtreatment that had a sidebar to a different article discussing some aspect of medicaid, so I’m not sure if the count is meaningful here. (The other 2 were about some loophole dentists were using to overtreat children on medicaid and bill extra, I have no knowledge of dental claims).
No, I meant going from 0 access to care to some access to care improves health, as we are discussing the medicaid study comparing people on medicaid to the uninsured.
This does not appear to be the actual change in access to care when going from being uninsured to on medicaid. As you mention, uninsured patients receive emergency-only care.
Such a study might show that it doesn’t matter on average. But you’d need those numbers to see if it’s increasing the spread of values. That would mean that it really helps some and hurts others. If you can figure out which is which, then it’ll end up being useful. Heck, this applies even if the average effect is negative.
I don’t know how often bio-researchers treat the standard deviation as part of their signal. I suspect it’s infrequent.
How large was your prior for “insurance helps some and harms others, and we should try to figure out which is which” before that was one possible way of rescuing insurance from this study? That sort of argument is, I respectfully suggest, a warning signal which should make you consider whether your bottom line is already written.
I wasn’t even thinking of insurance here. You were talking about garlic. I was thinking about my physics experiments where the standard deviation is a very useful channel of information.
If the effect is so small that a sample of several thousand is not sufficient to reliably observe it, then it doesn’t even matter that it is positive. An analogy: Suppose I tell you that eating garlic daily increases your IQ, and point to a study with three million participants and P < 1e-7. Vastly significant, no? Now it turns out that the actual size of the effect is 0.01 points of IQ. Are you going to start eating garlic? What if it weren’t garlic, but a several-billion-dollar government health program? Statistical significance is indeed not everything, but there’s such a thing as considering the size of an effect, especially if there’s a cost involved.
Moreover, please consider that “consistent with zero” means exactly that. If you throw a die ten times and it comes up heads six, do you “hesitantly update a very tiny bit” in the direction of the coin being biased? Would you do so, if you did not have a prior reason to hope that the coin was biased?
I respectfully suggest that you are letting your already-written bottom line interfere with your math.
If I throw a die and it comes up heads, I’d update in the direction of it being a very unusual die. :-)
I strongly disagree.
An old comment of mine gives us a counterexample. A couple of years ago, a meta-analysis of RCTs found that taking aspirin daily reduces the risk of dying from cancer by ~20% in middle-aged and older adults. This is very much a practically significant effect, and it’s probably an underestimate for reasons I’ll omit for brevity — look at the paper if you’re curious.
If you do look at the paper, notice figure 1, which summarizes the results of the 8 individual RCTs the meta-analysis used. Even though all of the RCTs had sample sizes in the thousands, 7 of them failed to show a statistically significant effect, including the 4 largest (sample sizes 5139, 5085, 3711 & 3310). The effect is therefore “so small that a sample of several thousand is not sufficient to reliably observe it”, but we would be absolutely wrong to infer that “it doesn’t even matter that it is positive”!
The heuristic that a hard-to-detect effect is probably too small to care about is a fair rule of thumb, but it’s only a heuristic. EHeller & Unnamed are quite right to point out that statistical significance and practical significance correlate only imperfectly.
tl;dr: NHST and Bayesian-style subjective probability do not mix easily.
Another example of this problem: http://slatestarcodex.com/2014/01/25/beware-mass-produced-medical-recommendations/
Does vitamin D reduce all-cause mortality in the elderly? The point-estimates from pretty much all of the various studies are around a 5% reduction in risk of dying for any reason—pretty nontrivial, one would say, no? Yet the results are almost all not ‘statistically significant’! So do we follow Rolf and say ‘fans of vitamin D ought to update on vitamin D not helping overall’… or do we, applying power considerations about the likelihood of making the hard cutoffs at p<0.05 given the small sample sizes & plausible effect sizes, note that the point-estimates are in favor of the hypothesis? (And how does this interact with two-sided tests—vitamin D could’ve increased mortality, after all. Positive point-estimates are consistent with vitamin D helping, and less consistent with no effect, and even less consistent with it harming; so why are we supposed to update in favor of no help or harm when we see a positive point-estimate?)
If we accept Rolf’s argument, then we’d be in the odd position of, as we read through one non-statistically-significant study after another, decreasing the probability of ‘non-zero reduction in mortality’… right up until we get the Autier or Cochrane data summarizing the exact same studies & plug it into a Bayesian meta-analysis like Salvatier did & abruptly flip to ’92% chance of non-zero reduction in mortality’.
That’s a curious metric to choose. By that standard taking aspirin is about as healthy as playing a round of Russian Roulette.
It’s a fairly natural metric to choose if one wishes to gauge aspirin’s effect on cancer risk, as the study’s authors did.
Fortunately, the study’s authors and I also interpreted the data by another standard. Daily aspirin reduced all-cause mortality, and didn’t increase non-cancer deaths (except for “a transient increase in risk of vascular death in the aspirin groups during the first year after completion of the trials”). These are not results we would see if aspirin effected its anti-cancer magic by a similar mechanism to Russian Roulette.
Pardon me. Mentioning only curiosity was politeness. The more significant meanings I would supplement with are ‘naive or suspicious’. By itself that metric really is worthless and reading this kind of health claim should set off warning bells. Lost purposes are a big problem when it comes to medicine. Partly because it is hard, mostly because there is more money in the area than nearly anywhere else.
And this is the reason low dose asprin is part of my daily supplement regime (while statins are not).
“All cause mortality” is a magical phrase.
I recently stopped with the low dose aspirin, the bleeding when I accidentally cut myself has proven to be too much of an inconvenience. For the time being, at least.
I’d assume they mean something like the per-year risk of dying from cancer conditional on previous survival—if they indeed mean the total lifetime risk of dying from cancer I agree it’s ridiculous.
Am I missing a subtlety here, or is it just that cancer is usually one of those things that you hope to live long enough to get?
Yeah, pretty much. There are other examples of this where something harmful appears to be helpful when you don’t take into account possible selection biases (like being put into the ‘non-cancer death’ category); for example, this is an issue in smoking—you can find various correlations where smokers are healthier than non-smokers, but this is just because the unhealthier smokers got pushed over the edge by smoking and died earlier.
Have you read the study in question? The treatment sample is NOT several thousand, its about 1500. Further, the incidence of the diseases being looked at are only a few percent or less, so the treatment sample sizes for the most prevalent diseases are around 50 (also, if you look at the specifics of the sample, the diseased groups are pretty well controlled).
I suggest the following exercise- ask yourself what WOULD be a big effect, and then work through if the study has the power to see it.
Yes, but in this case, the sample sizes are small and the error bars are so large that consistent with zero is ALSO consistent with 25+ % reduction in incidence (which is a large intervention). The study is incapable from distinguishing hugely important effect from 0 effect, so we shouldn’t update much at all, which is why I wished Mcardle had talked about statistical power. Before we ask “how should we update”, we should ask “what information is actually here?”
Edit: If we treat this as an exploration, it says “we need another study”- after all the effects could be as large as 40%! Thats a potentially tremendous intervention. Unfortunately, its unethical to randomly boot people off of insurance so we’ll likely never see that study done.
Health is extremely important—the statistical value of a human life is something like $8 million—so smallish looking effects can be practically relevant. An intervention that saves 1 life out of every 10,000 people treated has an average benefit of $800 per person. In this Oregon study, people who received Medicaid cost an extra $1,172 per year in total health spending, so the intervention would need to save 1.5 lives per 10,000 person-years (or provide an equivalent benefit in other health improvements) for the health benefits to balance out the health costs. The study looked at fewer than 10,000 people over 2 years, so the cost-benefit cutoff for whether it’s worth it is less than 3 lives saved (or equivalent).
So “not statistically significant” does not imply unimportant, even with a sample size of several thousand. An effect at the cost-benefit threshold is unlikely to show up in significant changes to mortality rates. The intermediate health measures in this study are more sensitive to changes than mortality rate, but were they sensitive enough? Has anyone run the numbers on how sensitive they’d need to be in order to find an effect of this size? The point estimates that they did report are (relative to control group) an 8% reduction in number of people with elevated blood pressure, 17% reduction in number of people with high cholesterol, and 18% reduction in number of people with high glycated hemoglobin levels (a marker of diabetes), which intuitively seem big enough to be part of an across-the-board health improvement that passes cost-benefit muster.
This would be much more convincing if you reported the costs along with the benefits, so that one could form some kind of estimate of what you’re willing to pay for this. But, again, I think your argument is motivated. “Consistent with zero” means just that; it means that the study cannot exclude the possibility that the intervention was actively harmful, but they had a random fluctuation in the data.
I get the impression that people here talk a good game about statistics, but haven’t really internalised the concept of error bars. I suggest that you have another look at why physics requires five sigma. There are really good reasons for that, you know; all the more so in a mindkilling-charged field.
I was responding to the suggestion that, even if the effects that they found are real, they are too small to matter. To me, that line of reasoning is a cue to do a Fermi estimate to get a quantitative sense of how big the effect would need to be in order to matter, and how that compares to the empirical results.
I didn’t get into a full-fledged Fermi estimate here (translating the measures that they used into the dollar value of the health benefits), which is hard to do that when they only collected data on a few intermediate health measures. (If anyone else has given it a shot, I’d like to take a look.) I did find a couple effect-size-related numbers for which I feel like I have some intuitive sense of their size, and they suggest that that line of reasoning does not go through. Effects that are big enough to matter relative to the costs of additional health spending (like 3 lives saved in their sample, or some equivalent benefit) seem small enough to avoid statistical significance, and the point estimates that they found which are not statistically significant (8-18% reductions in various metrics) seem large enough to matter.
My overall conclusion about the (based on what I know about it so far) study is that it provides little information for updating in any direction, because of those wide error bars. The results are consistent with Medicaid having no effect, they’re consistent with Medicaid having a modest health benefit (e.g., 10% reduction in a few bad things), they’re consistent with Medicaid being actively harmful, and they’re consistent with Medicaid having a large benefit (e.g. 40% reduction in many bad things). The likelihood ratios that the data provide for distinguishing between those alternatives are fairly close to one, with “modest health benefit” slightly favored over the more extreme alternatives.
Again, the original point McArdle is making is that “consistent with zero” is just completely not what the proponents expected beforehand, and they should update accordingly. See my discussion with TheOtherDave, below. A small effect may, indeed, be worth pursuing. But here we have a case where something fairly costly was done after much disagreement, and the proponents claimed that there would be a large effect. In that case, if you find a small effect, you ought not to say “Well, it’s still worth doing”; that’s not what you said before. It was claimed that there would be a large effect, and the program was passed on this basis. It is then dishonest to turn around and say “Ok, the effect is small but still worthwhile”. This ignores the inertia of political programs.
Most Medicaid proponents did not have expectations about the statistical results of this particular study. They did not make predictions about confidence intervals and p values for these particular analyses. Rather, they had expectations about the actual benefit of Medicaid.
You cite Ezra Klein as someone who expected that Medicaid would drastically reduce mortality; Klein was drawing his numbers from a report which estimated that in the US “137,000 people died from 2000 through 2006 because they lacked health insurance, including 22,000 people in 2006.” There were 47 million uninsured Americans in 2006, so those 22,000 excess deaths translate into 4.7 excess deaths per 10,000 uninsured people each year. So that’s the size of the drastic reduction in mortality that you’re referring to: 4.7 lives per 10,000 people each year. (For comparison, in my other comment I estimated that the Medicaid expansion would be worth its estimated cost if it saved at least 1.5 lives per 10,000 people each year or provided an equivalent benefit.)
Did the study rule out an effect as large as this drastic reduction of 4.7 per 10,000? As far as I can tell it did not (I’d like to see a more technical analysis of this). There were under 10,000 people in the study, so I wouldn’t be surprised if they missed effects of that size. Their point estimates, of an 8-18% reduction in various bad things, intuitively seem like they could be consistent with an effect that size. And the upper bounds of their confidence intervals (a 40%+ reduction in each of the 3 bad things) intuitively seem consistent with a much larger effect. So if people like Klein and Drum had made predictions in advance about the effect size of the Oregon intervention, I suspect that their predictions would have fallen within the study’s confidence interval.
There are presumably some people who did expect the results of the study to be statistically significant (otherwise, why run the study?), and they were wrong. But this isn’t a competition between opponents and proponents where every slipup by one side cedes territory to the other side. The data and results are there for us to look at, so we can update based on what the study actually found instead of on which side of the conflict fought better in this battle. In this case, it looks like the correct update based on the study (for most people, to a first approximation) is to not update at all. The confidence interval for the effects that they examined covers the full range of results that seemed plausible beforehand (including the no-effect-whatsoever hypothesis and the tens-of-thousands-of-lives-each-year hypothesis), so the study provides little information for updating one’s priors about the effectiveness of Medicaid.
For the people who did make the erroneous prediction that the study would find statistically significant results, why did they get it wrong? I’m not sure. A few possibilities: 1) they didn’t do an analysis of the study’s statistical power (or used some crude & mistaken heuristic to estimate power), 2) they overestimated how large a health benefit Medicaid would produce, 3) the control group in Oregon turned out to be healthier than they expected which left less room for Medicaid to show benefits, 4) fewer members of the experimental group than they expected ended up actually receiving Medicaid, which reduced the actual sample size and also added noise to the intent-to-treat analysis (reducing the effective sample size).
I do want to point out that, while I agree with your general points, I think that unless the proponents put numerical estimates up beforehand, it’s not quite fair to assume they meant “it will be statistically significant in a sample size of N at least 95% of the time.” Even if they said that, unless they explicitly calculated N, they probably underestimated it by at least one order of magnitude. (Professional researchers in social science make this mistake very frequently, and even when they avoid it, they can only very rarely find funding to actually collect N samples.)
I haven’t looked into this study in depth, so semi-related anecdote time: there was recently a study of calorie restriction in monkeys which had ~70 monkeys. The confidence interval for the hazard ratio included 1 (no effect), and so they concluded no statistically significant benefit to CR on mortality, though they could declare statistically significant benefit on a few varieties of mortality and several health proxies.
I ran the numbers to determine the power; turns out that they couldn’t have reliably noticed the effects of smoking (hazard ratio ~2) on longevity with a study of ~70 monkeys, and while I haven’t seen many quoted estimates of the hazard ratio of eating normally compared to CR, I don’t think there are many people that put them higher than 2.
When you don’t have the power to reliably conclude that all-cause mortality decreased, you can eke out some extra information by looking at the signs of all the proxies you measured. If insurance does nothing, we should expect to see the effect estimates scattered around 0. If insurance has a positive effect, we should expect to see more effect estimates above 0 than below 0, even though most will include 0 in their CI. (Suppose they measure 30 mortality proxies, and all of them show a positive effect, though the univariate CI includes 0 for all of them. If the ground truth was no effect on mortality proxies, that’s a very unlikely result to see; if the ground truth was a positive effect on mortality proxies, that’s a likely result to see.)
Incidentally, how did you do that?
If I remember correctly, I noticed an effect that did give a p of slightly less than .05 was a hazard ratio of 3, which made me think of running that test, and then I think spower was the r function that I used to figure out what p they could get for a hazard ratio of 2 and 35 experimentals and 35 controls (or whatever the actual split was- I think it was slightly different?).
So you were using
Hmisc::spower
… I’m surprised that there was even such a function (however obtusely named) - why on earth isn’t it in thesurvival
library?I was going to try to replicate that estimate, but looking at the spower documentation, it’s pretty complex and I don’t think I could do it without the original paper (which is more work than I want to do).
It is of course very difficult to extract any precise numbers from a political discussion. :) However, if you click through some of the links in the article, or have a look at the followup from today, you’ll find McArdle quoting predictions of tens of thousands of preventable deaths yearly from non-insured status. That looks to me like a pretty big hazard rate, no?
No. The Oracle says there’re about 50 million Americans without health insurance. The predictions you quoted refer to 18,000 or 27,000 deaths for want of insurance per year. The higher number implies only a 0.054% death rate per year, or a 3.5% death rate over 65 years (Americans over 65 automatically get insurance). This is non-negligible but hardly huge (and potentially important for all that).
Edit: and I see gwern has whupped me here.
Eyeballing the statistics, that looks like a hazard ratio between 1.1 and 1.5 (lots of things are good predictors for mortality that you would want to control for that I haven’t; the more you add, the closer that number should get to 1.1).
It looks like you’re referring to a hazard ratio or maybe a relative risk, neither of which are the same as a “hazard rate” AFAIK.
You’re right; I’m thinking of hazard ratios. Editing.
Over a population of something like 50 million people? Dunno.
If I throw a die once and it comes up heads I’m going to be confused. Now, assuming you meant “toss a coin and it comes up heads six times out of ten”.
What is your intended ‘correct’ answer to the question? I think I would indeed hesitantly update a very (very) tiny bit in the direction of the coin being biased but different priors regarding the possibility of the coin being biased in various ways and degrees could easily make the update be towards not-biased. I’d significantly lower p(the coin is biased by having two heads) but very slightly raise p(the coin is slightly heavier on the tails side), etc.
My intended correct answer is that, on this data, you technically can adjust your belief very slightly; but because the prior for a biased coin is so tiny, the update is not worth doing. The calculation cost way exceeds any benefit you can get from gruel this thin. I would say “Null hypothesis [ie unbiased coin] not disconfirmed; move along, nothing to see here”. And if you had a political reason for wishing the coin to be biased towards heads, then you should definitely not make any such update; because you certainly wouldn’t have done so, if tails had come up six times. In that case it would immediately have been “P-level is in the double digits” and “no statistical significance means exactly that” and “with those errors we’re still consistent with a heads bias”.
I would think that our prior for “health care improves health” should be quite a bit larger than the prior for a coin to be biased.
That depends on how long “we” have been reading Overcoming Bias.
Hanson’s point is that we often over-treat to show we care- not that 0 health care is optimal. Medicaid patients don’t really have to worry about overtreatment.
I was interpreting “health care improves health” as “healthcare improves health on the margin.” Is this not what was meant?
As someone who has a start-up in the healthcare industry, this runs counter to my personal experience. Also, currently “medicaid overtreatment” is showing about 676,000 results on Google (while “medicaid undertreatment” is showing about 1,240,000 results). Even if it isn’t typical, it surely isn’t an unheard-of phenomenon.
No, I meant going from 0 access to care to some access to care improves health, as we are discussing the medicaid study comparing people on medicaid to the uninsured.
I currently work as a statistician for a large HMO, and I can tell you for us, medicaid patients generally get the ‘patch-you-up-and-out-the-door’ treatment because odds are high we won’t be getting reimbursed in any kind of timely fashion. I’ve worked in a few states, and it seems pretty common for medicaid to be fairly underfunded (hence the Oregon study we are discussing).
And generally, providing medicaid is moving someone from emergency-only to some-primary-care, which is where we should expect some impact- this isn’t increasing treatment on the margin, its providing minimal care to a largely untreated population.
So I randomly sampled ~5 in the first two pages, and 3 of those were articles about overtreatment that had a sidebar to a different article discussing some aspect of medicaid, so I’m not sure if the count is meaningful here. (The other 2 were about some loophole dentists were using to overtreat children on medicaid and bill extra, I have no knowledge of dental claims).
This does not appear to be the actual change in access to care when going from being uninsured to on medicaid. As you mention, uninsured patients receive emergency-only care.
Such a study might show that it doesn’t matter on average. But you’d need those numbers to see if it’s increasing the spread of values. That would mean that it really helps some and hurts others. If you can figure out which is which, then it’ll end up being useful. Heck, this applies even if the average effect is negative.
I don’t know how often bio-researchers treat the standard deviation as part of their signal. I suspect it’s infrequent.
How large was your prior for “insurance helps some and harms others, and we should try to figure out which is which” before that was one possible way of rescuing insurance from this study? That sort of argument is, I respectfully suggest, a warning signal which should make you consider whether your bottom line is already written.
I wasn’t even thinking of insurance here. You were talking about garlic. I was thinking about my physics experiments where the standard deviation is a very useful channel of information.