which intuitively seem big enough to be part of an across-the-board health improvement that passes cost-benefit muster.
This would be much more convincing if you reported the costs along with the benefits, so that one could form some kind of estimate of what you’re willing to pay for this. But, again, I think your argument is motivated. “Consistent with zero” means just that; it means that the study cannot exclude the possibility that the intervention was actively harmful, but they had a random fluctuation in the data.
I get the impression that people here talk a good game about statistics, but haven’t really internalised the concept of error bars. I suggest that you have another look at why physics requires five sigma. There are really good reasons for that, you know; all the more so in a mindkilling-charged field.
I was responding to the suggestion that, even if the effects that they found are real, they are too small to matter. To me, that line of reasoning is a cue to do a Fermi estimate to get a quantitative sense of how big the effect would need to be in order to matter, and how that compares to the empirical results.
I didn’t get into a full-fledged Fermi estimate here (translating the measures that they used into the dollar value of the health benefits), which is hard to do that when they only collected data on a few intermediate health measures. (If anyone else has given it a shot, I’d like to take a look.) I did find a couple effect-size-related numbers for which I feel like I have some intuitive sense of their size, and they suggest that that line of reasoning does not go through. Effects that are big enough to matter relative to the costs of additional health spending (like 3 lives saved in their sample, or some equivalent benefit) seem small enough to avoid statistical significance, and the point estimates that they found which are not statistically significant (8-18% reductions in various metrics) seem large enough to matter.
My overall conclusion about the (based on what I know about it so far) study is that it provides little information for updating in any direction, because of those wide error bars. The results are consistent with Medicaid having no effect, they’re consistent with Medicaid having a modest health benefit (e.g., 10% reduction in a few bad things), they’re consistent with Medicaid being actively harmful, and they’re consistent with Medicaid having a large benefit (e.g. 40% reduction in many bad things). The likelihood ratios that the data provide for distinguishing between those alternatives are fairly close to one, with “modest health benefit” slightly favored over the more extreme alternatives.
Again, the original point McArdle is making is that “consistent with zero” is just completely not what the proponents expected beforehand, and they should update accordingly. See my discussion with TheOtherDave, below. A small effect may, indeed, be worth pursuing. But here we have a case where something fairly costly was done after much disagreement, and the proponents claimed that there would be a large effect. In that case, if you find a small effect, you ought not to say “Well, it’s still worth doing”; that’s not what you said before. It was claimed that there would be a large effect, and the program was passed on this basis. It is then dishonest to turn around and say “Ok, the effect is small but still worthwhile”. This ignores the inertia of political programs.
Most Medicaid proponents did not have expectations about the statistical results of this particular study. They did not make predictions about confidence intervals and p values for these particular analyses. Rather, they had expectations about the actual benefit of Medicaid.
You cite Ezra Klein as someone who expected that Medicaid would drastically reduce mortality; Klein was drawing his numbers from a report which estimated that in the US “137,000 people died from 2000 through 2006 because they lacked health insurance, including 22,000 people in 2006.” There were 47 million uninsured Americans in 2006, so those 22,000 excess deaths translate into 4.7 excess deaths per 10,000 uninsured people each year. So that’s the size of the drastic reduction in mortality that you’re referring to: 4.7 lives per 10,000 people each year. (For comparison, in my other comment I estimated that the Medicaid expansion would be worth its estimated cost if it saved at least 1.5 lives per 10,000 people each year or provided an equivalent benefit.)
Did the study rule out an effect as large as this drastic reduction of 4.7 per 10,000? As far as I can tell it did not (I’d like to see a more technical analysis of this). There were under 10,000 people in the study, so I wouldn’t be surprised if they missed effects of that size. Their point estimates, of an 8-18% reduction in various bad things, intuitively seem like they could be consistent with an effect that size. And the upper bounds of their confidence intervals (a 40%+ reduction in each of the 3 bad things) intuitively seem consistent with a much larger effect. So if people like Klein and Drum had made predictions in advance about the effect size of the Oregon intervention, I suspect that their predictions would have fallen within the study’s confidence interval.
There are presumably some people who did expect the results of the study to be statistically significant (otherwise, why run the study?), and they were wrong. But this isn’t a competition between opponents and proponents where every slipup by one side cedes territory to the other side. The data and results are there for us to look at, so we can update based on what the study actually found instead of on which side of the conflict fought better in this battle. In this case, it looks like the correct update based on the study (for most people, to a first approximation) is to not update at all. The confidence interval for the effects that they examined covers the full range of results that seemed plausible beforehand (including the no-effect-whatsoever hypothesis and the tens-of-thousands-of-lives-each-year hypothesis), so the study provides little information for updating one’s priors about the effectiveness of Medicaid.
For the people who did make the erroneous prediction that the study would find statistically significant results, why did they get it wrong? I’m not sure. A few possibilities: 1) they didn’t do an analysis of the study’s statistical power (or used some crude & mistaken heuristic to estimate power), 2) they overestimated how large a health benefit Medicaid would produce, 3) the control group in Oregon turned out to be healthier than they expected which left less room for Medicaid to show benefits, 4) fewer members of the experimental group than they expected ended up actually receiving Medicaid, which reduced the actual sample size and also added noise to the intent-to-treat analysis (reducing the effective sample size).
I do want to point out that, while I agree with your general points, I think that unless the proponents put numerical estimates up beforehand, it’s not quite fair to assume they meant “it will be statistically significant in a sample size of N at least 95% of the time.” Even if they said that, unless they explicitly calculated N, they probably underestimated it by at least one order of magnitude. (Professional researchers in social science make this mistake very frequently, and even when they avoid it, they can only very rarely find funding to actually collect N samples.)
I haven’t looked into this study in depth, so semi-related anecdote time: there was recently a study of calorie restriction in monkeys which had ~70 monkeys. The confidence interval for the hazard ratio included 1 (no effect), and so they concluded no statistically significant benefit to CR on mortality, though they could declare statistically significant benefit on a few varieties of mortality and several health proxies.
I ran the numbers to determine the power; turns out that they couldn’t have reliably noticed the effects of smoking (hazard ratio ~2) on longevity with a study of ~70 monkeys, and while I haven’t seen many quoted estimates of the hazard ratio of eating normally compared to CR, I don’t think there are many people that put them higher than 2.
When you don’t have the power to reliably conclude that all-cause mortality decreased, you can eke out some extra information by looking at the signs of all the proxies you measured. If insurance does nothing, we should expect to see the effect estimates scattered around 0. If insurance has a positive effect, we should expect to see more effect estimates above 0 than below 0, even though most will include 0 in their CI. (Suppose they measure 30 mortality proxies, and all of them show a positive effect, though the univariate CI includes 0 for all of them. If the ground truth was no effect on mortality proxies, that’s a very unlikely result to see; if the ground truth was a positive effect on mortality proxies, that’s a likely result to see.)
I ran the numbers to determine the power; turns out that they couldn’t have reliably noticed the effects of smoking (hazard ratio ~2) on longevity with a study of ~70 monkeys, and while I haven’t seen many quoted estimates of the hazard ratio of eating normally compared to CR, I don’t think there are many people that put them higher than 2.
If I remember correctly, I noticed an effect that did give a p of slightly less than .05 was a hazard ratio of 3, which made me think of running that test, and then I think spower was the r function that I used to figure out what p they could get for a hazard ratio of 2 and 35 experimentals and 35 controls (or whatever the actual split was- I think it was slightly different?).
So you were using Hmisc::spower… I’m surprised that there was even such a function (however obtusely named) - why on earth isn’t it in the survival library?
I was going to try to replicate that estimate, but looking at the spower documentation, it’s pretty complex and I don’t think I could do it without the original paper (which is more work than I want to do).
It is of course very difficult to extract any precise numbers from a political discussion. :) However, if you click through some of the links in the article, or have a look at the followup from today, you’ll find McArdle quoting predictions of tens of thousands of preventable deaths yearly from non-insured status. That looks to me like a pretty big hazard rate, no?
you’ll find McArdle quoting predictions of tens of thousands of preventable deaths yearly from non-insured status. That looks to me like a pretty big hazard rate, no?
No. The Oracle says there’re about 50 million Americans without health insurance. The predictions you quoted refer to 18,000 or 27,000 deaths for want of insurance per year. The higher number implies only a 0.054% death rate per year, or a 3.5% death rate over 65 years (Americans over 65 automatically get insurance). This is non-negligible but hardly huge (and potentially important for all that).
The higher number implies only a 0.054% death rate per year
Eyeballing the statistics, that looks like a hazard ratio between 1.1 and 1.5 (lots of things are good predictors for mortality that you would want to control for that I haven’t; the more you add, the closer that number should get to 1.1).
This would be much more convincing if you reported the costs along with the benefits, so that one could form some kind of estimate of what you’re willing to pay for this. But, again, I think your argument is motivated. “Consistent with zero” means just that; it means that the study cannot exclude the possibility that the intervention was actively harmful, but they had a random fluctuation in the data.
I get the impression that people here talk a good game about statistics, but haven’t really internalised the concept of error bars. I suggest that you have another look at why physics requires five sigma. There are really good reasons for that, you know; all the more so in a mindkilling-charged field.
I was responding to the suggestion that, even if the effects that they found are real, they are too small to matter. To me, that line of reasoning is a cue to do a Fermi estimate to get a quantitative sense of how big the effect would need to be in order to matter, and how that compares to the empirical results.
I didn’t get into a full-fledged Fermi estimate here (translating the measures that they used into the dollar value of the health benefits), which is hard to do that when they only collected data on a few intermediate health measures. (If anyone else has given it a shot, I’d like to take a look.) I did find a couple effect-size-related numbers for which I feel like I have some intuitive sense of their size, and they suggest that that line of reasoning does not go through. Effects that are big enough to matter relative to the costs of additional health spending (like 3 lives saved in their sample, or some equivalent benefit) seem small enough to avoid statistical significance, and the point estimates that they found which are not statistically significant (8-18% reductions in various metrics) seem large enough to matter.
My overall conclusion about the (based on what I know about it so far) study is that it provides little information for updating in any direction, because of those wide error bars. The results are consistent with Medicaid having no effect, they’re consistent with Medicaid having a modest health benefit (e.g., 10% reduction in a few bad things), they’re consistent with Medicaid being actively harmful, and they’re consistent with Medicaid having a large benefit (e.g. 40% reduction in many bad things). The likelihood ratios that the data provide for distinguishing between those alternatives are fairly close to one, with “modest health benefit” slightly favored over the more extreme alternatives.
Again, the original point McArdle is making is that “consistent with zero” is just completely not what the proponents expected beforehand, and they should update accordingly. See my discussion with TheOtherDave, below. A small effect may, indeed, be worth pursuing. But here we have a case where something fairly costly was done after much disagreement, and the proponents claimed that there would be a large effect. In that case, if you find a small effect, you ought not to say “Well, it’s still worth doing”; that’s not what you said before. It was claimed that there would be a large effect, and the program was passed on this basis. It is then dishonest to turn around and say “Ok, the effect is small but still worthwhile”. This ignores the inertia of political programs.
Most Medicaid proponents did not have expectations about the statistical results of this particular study. They did not make predictions about confidence intervals and p values for these particular analyses. Rather, they had expectations about the actual benefit of Medicaid.
You cite Ezra Klein as someone who expected that Medicaid would drastically reduce mortality; Klein was drawing his numbers from a report which estimated that in the US “137,000 people died from 2000 through 2006 because they lacked health insurance, including 22,000 people in 2006.” There were 47 million uninsured Americans in 2006, so those 22,000 excess deaths translate into 4.7 excess deaths per 10,000 uninsured people each year. So that’s the size of the drastic reduction in mortality that you’re referring to: 4.7 lives per 10,000 people each year. (For comparison, in my other comment I estimated that the Medicaid expansion would be worth its estimated cost if it saved at least 1.5 lives per 10,000 people each year or provided an equivalent benefit.)
Did the study rule out an effect as large as this drastic reduction of 4.7 per 10,000? As far as I can tell it did not (I’d like to see a more technical analysis of this). There were under 10,000 people in the study, so I wouldn’t be surprised if they missed effects of that size. Their point estimates, of an 8-18% reduction in various bad things, intuitively seem like they could be consistent with an effect that size. And the upper bounds of their confidence intervals (a 40%+ reduction in each of the 3 bad things) intuitively seem consistent with a much larger effect. So if people like Klein and Drum had made predictions in advance about the effect size of the Oregon intervention, I suspect that their predictions would have fallen within the study’s confidence interval.
There are presumably some people who did expect the results of the study to be statistically significant (otherwise, why run the study?), and they were wrong. But this isn’t a competition between opponents and proponents where every slipup by one side cedes territory to the other side. The data and results are there for us to look at, so we can update based on what the study actually found instead of on which side of the conflict fought better in this battle. In this case, it looks like the correct update based on the study (for most people, to a first approximation) is to not update at all. The confidence interval for the effects that they examined covers the full range of results that seemed plausible beforehand (including the no-effect-whatsoever hypothesis and the tens-of-thousands-of-lives-each-year hypothesis), so the study provides little information for updating one’s priors about the effectiveness of Medicaid.
For the people who did make the erroneous prediction that the study would find statistically significant results, why did they get it wrong? I’m not sure. A few possibilities: 1) they didn’t do an analysis of the study’s statistical power (or used some crude & mistaken heuristic to estimate power), 2) they overestimated how large a health benefit Medicaid would produce, 3) the control group in Oregon turned out to be healthier than they expected which left less room for Medicaid to show benefits, 4) fewer members of the experimental group than they expected ended up actually receiving Medicaid, which reduced the actual sample size and also added noise to the intent-to-treat analysis (reducing the effective sample size).
I do want to point out that, while I agree with your general points, I think that unless the proponents put numerical estimates up beforehand, it’s not quite fair to assume they meant “it will be statistically significant in a sample size of N at least 95% of the time.” Even if they said that, unless they explicitly calculated N, they probably underestimated it by at least one order of magnitude. (Professional researchers in social science make this mistake very frequently, and even when they avoid it, they can only very rarely find funding to actually collect N samples.)
I haven’t looked into this study in depth, so semi-related anecdote time: there was recently a study of calorie restriction in monkeys which had ~70 monkeys. The confidence interval for the hazard ratio included 1 (no effect), and so they concluded no statistically significant benefit to CR on mortality, though they could declare statistically significant benefit on a few varieties of mortality and several health proxies.
I ran the numbers to determine the power; turns out that they couldn’t have reliably noticed the effects of smoking (hazard ratio ~2) on longevity with a study of ~70 monkeys, and while I haven’t seen many quoted estimates of the hazard ratio of eating normally compared to CR, I don’t think there are many people that put them higher than 2.
When you don’t have the power to reliably conclude that all-cause mortality decreased, you can eke out some extra information by looking at the signs of all the proxies you measured. If insurance does nothing, we should expect to see the effect estimates scattered around 0. If insurance has a positive effect, we should expect to see more effect estimates above 0 than below 0, even though most will include 0 in their CI. (Suppose they measure 30 mortality proxies, and all of them show a positive effect, though the univariate CI includes 0 for all of them. If the ground truth was no effect on mortality proxies, that’s a very unlikely result to see; if the ground truth was a positive effect on mortality proxies, that’s a likely result to see.)
Incidentally, how did you do that?
If I remember correctly, I noticed an effect that did give a p of slightly less than .05 was a hazard ratio of 3, which made me think of running that test, and then I think spower was the r function that I used to figure out what p they could get for a hazard ratio of 2 and 35 experimentals and 35 controls (or whatever the actual split was- I think it was slightly different?).
So you were using
Hmisc::spower
… I’m surprised that there was even such a function (however obtusely named) - why on earth isn’t it in thesurvival
library?I was going to try to replicate that estimate, but looking at the spower documentation, it’s pretty complex and I don’t think I could do it without the original paper (which is more work than I want to do).
It is of course very difficult to extract any precise numbers from a political discussion. :) However, if you click through some of the links in the article, or have a look at the followup from today, you’ll find McArdle quoting predictions of tens of thousands of preventable deaths yearly from non-insured status. That looks to me like a pretty big hazard rate, no?
No. The Oracle says there’re about 50 million Americans without health insurance. The predictions you quoted refer to 18,000 or 27,000 deaths for want of insurance per year. The higher number implies only a 0.054% death rate per year, or a 3.5% death rate over 65 years (Americans over 65 automatically get insurance). This is non-negligible but hardly huge (and potentially important for all that).
Edit: and I see gwern has whupped me here.
Eyeballing the statistics, that looks like a hazard ratio between 1.1 and 1.5 (lots of things are good predictors for mortality that you would want to control for that I haven’t; the more you add, the closer that number should get to 1.1).
It looks like you’re referring to a hazard ratio or maybe a relative risk, neither of which are the same as a “hazard rate” AFAIK.
You’re right; I’m thinking of hazard ratios. Editing.
Over a population of something like 50 million people? Dunno.