Even if you say that science isn’t about solving real world issues but about knowledge, I also think that replication rates of 11% in the case of breakthrough cancer research indicates that the field is not good at finding out what’s going on.
I don’t think a flat replication rate of 11% tells us anything without recourse to additional considerations. It’s sort of like a Umeshism: if your experiments are not routinely failing, you aren’t really experimenting. The best we can say is that 0% and 100% are both suboptimal...
For example, if I was told that anti-aging research was having a 11% replication rate for its ‘stopping aging’ treatments, I would regard this as shockingly too high and a collective crime on par with the Nazis, and if anyone asked me, would tell them that we need to spend far far more on anti-aging research because we clearly are not trying nearly enough crazy ideas. And if someone told me the clinical trials for curing balding were replicating at 89%, I would be a little uneasy and wonder what side-effects we were exposing all these people to.
(Heck, you can’t even tell much about the quality of the research from just a flat replication rate. If the prior odds are 1 in 10,000, then 11% looks pretty damn good. If the prior odds are 1 in 5, pretty damn bad.)
What I would accept as a useful invocation of an 11% rate is, say, an economic analysis of the benefits showing that this represents over-investment (for example, falling pharmacorp share prices) or surprise by planners/scientists/CEOs/bureaucrats where they had held more optimistic assumptions (and so investment is likely being wasted). That sort of thing.
Replication rate of experiments is quite different from the success rate of experiments.
An 11% success rate is often shockingly high. An 11% replication rate means the researchers are sloppy, value publishing over confidence in the results, and likely do way too much of throwing spaghetti at the wall...
Even granting your distinction, the exact same argument still applies: just substitute in an additional rate of, say, 10% chance of going from replication to whatever you choose to define as ‘success’. You cannot say that a 11% replication rate and then a 1.1% success rate is optimal—or suboptimal—without doing more intellectual work!
No, I don’t think so. An 11% replication rate means that 89% of the published results are junk and external observers have no problems seeing that. Which implies that if those who published it were a bit more honest/critical/responsible, they should have been able to do a better job of controlling for the effects which lead them to think there’s statistical significance when in fact there’s none.
If the prior odds are 1:10,000 you have no business publishing results at 0.05 confidence level.
An 11% replication rate means that 89% of the published results are junk and external observers have no problems seeing that.
Yes, so? As Edison said, I have discovered 999 ways to not build a lightbulb.
Which implies that if those who published it were a bit more honest/critical/responsible, they should have been able to do a better job of controlling for the effects which lead them to think there’s statistical significance when in fact there’s none.
Huh? No. As I already said, you cannot go from replication rate to judgment of the honesty, competency, or insight of researchers without additional information. Most obviously, it’s going to be massively influenced by the prior odds of the hypotheses.
If the prior odds are 1:10,000 you have no business publishing results at 0.05 confidence level.
No one has any business publishing at an arbitrary confidence level, which should be chosen with respect to some even half-assed decision analysis. 1:10,000 or 1:1000, doesn’t matter.
As Edison said, I have discovered 999 ways to not build a lightbulb.
You’re still ignoring the difference between a failed experiment and a failed replication.
Edison did not publish 999 papers each of them claiming that this is the way to build the lightbulb (at p=0.05).
you cannot go from replication rate to judgment of the honesty, competency, or insight of researchers without additional information. Most obviously, it’s going to be massively influenced by the prior odds of the hypotheses.
And what exactly prevents the researchers from considering the prior odds when they are trying to figure out whether their results are really statistically significant?
I disagree with you—if a researcher consistently publishes research that cannot be replicated I will call him a bad researcher.
You’re still ignoring the difference between a failed experiment and a failed replication. Edison did not publish 999 papers each of them claiming that this is the way to build the lightbulb (at p=0.05).
So? What does this have to do with my point about optimizing return from experimentation?
And what exactly prevents the researchers from considering the prior odds when they are trying to figure out whether their results are really statistically significant?
Nothing. But no one does that because to point out that a normal experiment has resulted in a posterior probability of <5% is not helpful since that could be said of all experiments, and to run a single experiment so high-powered that it could single-handedly overcome the prior probability is ludicrously wasteful. You don’t run a $50m clinical trial enrolling 50,000 people just because some drug looks interesting.
I disagree with you—if a researcher consistently publishes research that cannot be replicated I will call him a bad researcher.
I think our disagreement comes (at least partially) from the different views on what does publishing research mean.
I see your position as looking on publishing as something like “We did A, B, and C. We got the results X and Y. Take it for what it is. The end.”
I’m looking on publishing more like this: “We did multiple experiments which did not give us the magical 0.05 number so we won’t tell you about them. But hey, try #39 succeeded and we can publish it: we did A39, B39, and C39 and got the results X39 and Y39. The results are significant so we believe them to be meaningful and reflective of actual reality. Please give our drug to your patients.”
The realities of scientific publishing are unfortunate (and yes, I know of efforts to ameliorate the problem in medical research). If people published all their research (“We did 50 runs with the following parameters, all failed, sure #39 showed statistical significance but we don’t believe it”) I would have zero problems with it. But that’s not how the world currently works.
P.S. By the way, here is some research which failed replication (via this)
The realities of scientific publishing are unfortunate (and yes, I know of efforts to ameliorate the problem in medical research). If people published all their research (“We did 50 runs with the following parameters, all failed, sure #39 showed statistical significance but we don’t believe it”) I would have zero problems with it. But that’s not how the world currently works.
That would be a better world. But in this world, it would still be true that there is no universal, absolute, optimal percentage of experiments failing to replicate, and the optimal percentage is set by decision-theoretic/economic concerns.
Experiments that fail to replicate at percentages greater than those expected from published confidence values (say, posterior probabilities) are evidence that the published confidence values are wrong.
A research process that consistently produces wrong confidence values has serious problems.
Experiments that fail to replicate at percentages greater than those expected from published confidence values (say, posterior probabilities) are evidence that the published confidence values are wrong.
How would you know? People do not produce posterior probabilities or credible intervals, they produce confidence intervals and p-values.
Either the p-values in the papers are worthless in the sense of not reflecting the probability that the observed effect is real—in which case the issue in the parent post stands.
Or the p-values, while not perfect, do reflect the probability the effect is real—in which case they are falsified by the replication rates and in which case the issue in the parent post stands.
Either the p-values in the papers are worthless in the sense of not reflecting the probability that the observed effect is real
p-values do not reflect the probability that the observed effect is real but the inverse, and no one has ever claimed that, so we can safely dismiss this entire line of thought.
Or the p-values, while not perfect, do reflect the probability the effect is real
p-values can, with some assumptions and choices, be used to calculate other things like positive predictive value/PPV, which are more meaningful. However, the issue still stands. Suppose a field’s studies have a PPV of 20%. Is this good or bad? I don’t know—it depends on the uses you intend to put it to and the loss function on the results.
Maybe it would be helpful if I put it in Bayesian terms where the terms are more meaningful & easier to understand. Suppose an experiment turns in a posterior with 80% of the distribution >0. Subsequent experiments or additional data collection will agree with and ‘replicate’ this result the obvious amount.
Now, was this experiment ‘underpowered’ (it collected too little data and is bad) or ‘overpowered’ (too much and inefficient/unethical) or just right? Was this field too tolerant of shoddy research practices in producing that result?
Well, if the associated loss function has a high penalty on true values being <0 (because the cancer drugs have nasty side-effects and are expensive and only somewhat improve on the other drugs) then it was probably underpowered; if it has a small loss function (because it was a website A/B test and you lose little if it was a worse variant) then it was probably overpowered because you spent more traffic/samples than you had to to choose a variant.
The ‘replication crises’ are a ‘crisis’ in part because people are basing meaningful decisions on the results to an extent that cannot be justified if one were to explicitly go through a Bayesian & decision theory analysis with informative data. eg pharmacorps probably should not be spending millions of dollars to buy and do preliminary trials on research which is not much distinguishable from noise, as they have learned to their intense frustration & financial cost, to say nothing of diet research. If the results did not matter to anyone, then it would not be a big deal if the PPV were 5% rather than 50%: the researchers would cope, and other people would not make costly suboptimal decisions.
There is no single replication rate which is ideal for cancer trials and GWASes and individual differences psychology research and taxonomy and ecology and schizophrenia trials and...
I don’t think a flat replication rate of 11% tells us anything without recourse to additional considerations. It’s sort of like a Umeshism: if your experiments are not routinely failing, you aren’t really experimenting. The best we can say is that 0% and 100% are both suboptimal...
For example, if I was told that anti-aging research was having a 11% replication rate for its ‘stopping aging’ treatments, I would regard this as shockingly too high and a collective crime on par with the Nazis, and if anyone asked me, would tell them that we need to spend far far more on anti-aging research because we clearly are not trying nearly enough crazy ideas. And if someone told me the clinical trials for curing balding were replicating at 89%, I would be a little uneasy and wonder what side-effects we were exposing all these people to.
(Heck, you can’t even tell much about the quality of the research from just a flat replication rate. If the prior odds are 1 in 10,000, then 11% looks pretty damn good. If the prior odds are 1 in 5, pretty damn bad.)
What I would accept as a useful invocation of an 11% rate is, say, an economic analysis of the benefits showing that this represents over-investment (for example, falling pharmacorp share prices) or surprise by planners/scientists/CEOs/bureaucrats where they had held more optimistic assumptions (and so investment is likely being wasted). That sort of thing.
Replication rate of experiments is quite different from the success rate of experiments.
An 11% success rate is often shockingly high. An 11% replication rate means the researchers are sloppy, value publishing over confidence in the results, and likely do way too much of throwing spaghetti at the wall...
Even granting your distinction, the exact same argument still applies: just substitute in an additional rate of, say, 10% chance of going from replication to whatever you choose to define as ‘success’. You cannot say that a 11% replication rate and then a 1.1% success rate is optimal—or suboptimal—without doing more intellectual work!
No, I don’t think so. An 11% replication rate means that 89% of the published results are junk and external observers have no problems seeing that. Which implies that if those who published it were a bit more honest/critical/responsible, they should have been able to do a better job of controlling for the effects which lead them to think there’s statistical significance when in fact there’s none.
If the prior odds are 1:10,000 you have no business publishing results at 0.05 confidence level.
Yes, so? As Edison said, I have discovered 999 ways to not build a lightbulb.
Huh? No. As I already said, you cannot go from replication rate to judgment of the honesty, competency, or insight of researchers without additional information. Most obviously, it’s going to be massively influenced by the prior odds of the hypotheses.
No one has any business publishing at an arbitrary confidence level, which should be chosen with respect to some even half-assed decision analysis. 1:10,000 or 1:1000, doesn’t matter.
You’re still ignoring the difference between a failed experiment and a failed replication.
Edison did not publish 999 papers each of them claiming that this is the way to build the lightbulb (at p=0.05).
And what exactly prevents the researchers from considering the prior odds when they are trying to figure out whether their results are really statistically significant?
I disagree with you—if a researcher consistently publishes research that cannot be replicated I will call him a bad researcher.
So? What does this have to do with my point about optimizing return from experimentation?
Nothing. But no one does that because to point out that a normal experiment has resulted in a posterior probability of <5% is not helpful since that could be said of all experiments, and to run a single experiment so high-powered that it could single-handedly overcome the prior probability is ludicrously wasteful. You don’t run a $50m clinical trial enrolling 50,000 people just because some drug looks interesting.
Too bad. You should get over that.
I think our disagreement comes (at least partially) from the different views on what does publishing research mean.
I see your position as looking on publishing as something like “We did A, B, and C. We got the results X and Y. Take it for what it is. The end.”
I’m looking on publishing more like this: “We did multiple experiments which did not give us the magical 0.05 number so we won’t tell you about them. But hey, try #39 succeeded and we can publish it: we did A39, B39, and C39 and got the results X39 and Y39. The results are significant so we believe them to be meaningful and reflective of actual reality. Please give our drug to your patients.”
The realities of scientific publishing are unfortunate (and yes, I know of efforts to ameliorate the problem in medical research). If people published all their research (“We did 50 runs with the following parameters, all failed, sure #39 showed statistical significance but we don’t believe it”) I would have zero problems with it. But that’s not how the world currently works.
P.S. By the way, here is some research which failed replication (via this)
That would be a better world. But in this world, it would still be true that there is no universal, absolute, optimal percentage of experiments failing to replicate, and the optimal percentage is set by decision-theoretic/economic concerns.
Experiments that fail to replicate at percentages greater than those expected from published confidence values (say, posterior probabilities) are evidence that the published confidence values are wrong.
A research process that consistently produces wrong confidence values has serious problems.
How would you know? People do not produce posterior probabilities or credible intervals, they produce confidence intervals and p-values.
I don’t see how this point helps you.
Either the p-values in the papers are worthless in the sense of not reflecting the probability that the observed effect is real—in which case the issue in the parent post stands.
Or the p-values, while not perfect, do reflect the probability the effect is real—in which case they are falsified by the replication rates and in which case the issue in the parent post stands.
p-values do not reflect the probability that the observed effect is real but the inverse, and no one has ever claimed that, so we can safely dismiss this entire line of thought.
p-values can, with some assumptions and choices, be used to calculate other things like positive predictive value/PPV, which are more meaningful. However, the issue still stands. Suppose a field’s studies have a PPV of 20%. Is this good or bad? I don’t know—it depends on the uses you intend to put it to and the loss function on the results.
Maybe it would be helpful if I put it in Bayesian terms where the terms are more meaningful & easier to understand. Suppose an experiment turns in a posterior with 80% of the distribution >0. Subsequent experiments or additional data collection will agree with and ‘replicate’ this result the obvious amount.
Now, was this experiment ‘underpowered’ (it collected too little data and is bad) or ‘overpowered’ (too much and inefficient/unethical) or just right? Was this field too tolerant of shoddy research practices in producing that result?
Well, if the associated loss function has a high penalty on true values being <0 (because the cancer drugs have nasty side-effects and are expensive and only somewhat improve on the other drugs) then it was probably underpowered; if it has a small loss function (because it was a website A/B test and you lose little if it was a worse variant) then it was probably overpowered because you spent more traffic/samples than you had to to choose a variant.
The ‘replication crises’ are a ‘crisis’ in part because people are basing meaningful decisions on the results to an extent that cannot be justified if one were to explicitly go through a Bayesian & decision theory analysis with informative data. eg pharmacorps probably should not be spending millions of dollars to buy and do preliminary trials on research which is not much distinguishable from noise, as they have learned to their intense frustration & financial cost, to say nothing of diet research. If the results did not matter to anyone, then it would not be a big deal if the PPV were 5% rather than 50%: the researchers would cope, and other people would not make costly suboptimal decisions.
There is no single replication rate which is ideal for cancer trials and GWASes and individual differences psychology research and taxonomy and ecology and schizophrenia trials and...