Sentence by Sentence: “Why Most Published Research Findings are False”
Preparing for a career in biomedical research, I thought it prudent to thoroughly read the leading expositor of profound skepticism toward my intended domain. I’m an undergraduate student with only a very basic understanding of statistics and zero professional experience in scientific research. This is my sentence-by-sentence reaction/live-blog to reading Ioaniddis’s most famous paper.
I’ve put his original headings in bold, and quotes from his paper are indented. The one part I didn’t respond to is the opening paragraph. I got the last word, so I’ll give John the first word:
Published research findings are sometimes refuted by subsequent evidence, with ensuing confusion and disappointment. Refutation and controversy is seen across the range of research designs, from clinical trials and traditional epidemiological studies to the most modern molecular research. There is increasing concern that in modern research, false findings may be the majority or even the vast majority of published research claims. However, this should not be surprising. It can be proven that most claimed research findings are false. Here I will examine the key factors that influence this problem and some corollaries thereof.
Modeling the Framework for False Positive Findings
Several methodologists have pointed out that the high rate of nonreplication (lack of confirmation) of research discoveries is a consequence of the convenient, yet ill-founded strategy of claiming conclusive research findings solely on the basis of a single study assessed by formal statistical significance, typically for a p-value less than 0.05.
Why do replication studies so often fail? One possibility is that scientists are good at predicting which ones are incorrect and targeting them for a repeat trial. Another is that there are a lot of false positives out there, which is the issue Ionannidis is grappling with.
However, note that he’s not claiming that most of the research you’re citing in your own work is most likely false. Science may have informal methods to separate the wheat from the chaff post-publication: journal prestige, referrals, mechanistic plausibility, durability, and perhaps others.
Instead, he’s saying that if you could gather up every single research paper published last year, write each finding on a separate notecard, and draw them randomly out of a hat, most of the claims you drew would be false.
Research is not most appropriately represented and summarized by p-values, but, unfortunately, there is a widespread notion that medical research articles should be interpreted based only on p-values.
How widespread is this notion, and who are the most common offenders? Medical administrators? Clinicians? Researchers? To what extent is this a problem in the popular press, research articles, and textbooks? How is this notion represented in speech and the written word?
Where should students have their guard up against their own teachers and curriculums, and when can they let down their guard?
Research findings are defined here as any relationship reaching formal statistical significance, e.g., effective interventions, informative predictors, risk factors, or associations… However, here we will target relationships that investigators claim exist, rather than null findings.
Hence, a study that finds p >= 0.05 is not considered a research finding by this definition. In this paper, Ionannidis is making claims only about the rate of false positives, not about the rate of false negatives.
“Negative” research is also very useful. “Negative” is actually a misnomer, and the misinterpretation is widespread.
In my undergraduate research class, students came up with their own research ideas. We were required to do a mini-study using the ion chromatographer. One of our group members was planning to become a dietician, and he’d heard that nitrites are linked to stomach cancer. We were curious about whether organic and non-organic apples contained different levels of nitrites. So we bought some of each, blended them up, extracted the juice, and ran several samples through the IC.
We found no statistical difference in the level of nitrites between the organic and inorganic apples, and a very low absolute level of nitrites. But there were high levels of sulfates. Our instructor considered this negative result a problem, a failure of the research. Rather than advising us to use the null result as a useful piece of information, she suggested that we figure out a reason why high levels of sulfates might be a problem, and use that as our research finding.
Well, drinking huge amounts of sulfates seems to cause some health effects, though nothing as exciting as stomach cancer. Overall, though, “The existing data do not identify a level of sulfate in drinking-water that is likely to cause adverse human health effects.” I’m sure we managed to find some way to exaggerate the health risks of sulfates in apples in order to get through the assignment. But the experience left a bad taste in my mouth.
My first experience in a research class was not only being told that a null finding was a failure, but that my response to it should be to dredge the data and exaggerate its importance in order to produce something that looked superficially compelling. So that’s one way this “misinterpretation” can look in real life.
As has been shown previously, the probability that a research finding is indeed true depends on the prior probability of it being true (before doing the study), the statistical power of the study, and the level of statistical significance [10,11].
“The prior probability of it being true before doing the study? How can anybody know that? Isn’t that the reason we did the study in the first place?”
Don’t worry, John’s going to explain!
Consider a 2 × 2 table in which research findings are compared against the gold standard of true relationships in a scientific field.
On this 2x2 table, we have four possibilities—two types of success, and two types of mistakes. The successes are when science discovers a real relationship, or disproves a fake one. The failures are when it mistakenly “discovers” a fake relationship, a false positive or type I error, or “disproves” a real one, a false negative or type II error. False positives, type I errors, are the type of problem Ionannidis is dealing with here.
In a research field both true and false hypotheses can be made about the presence of relationships.
Scientists have some educated guesses about what as-yet-untested hypotheses about relationships are reasonable enough to be worth a study. Is hair color linked to IQ? Is wake-up time linked to income? Is serotonin linked to depression? Sometimes they’ll be right, other times they’ll be wrong.
Let R be the ratio of the number of “true relationships” to “no relationships” among those tested in the field.
So let’s pretend we knew how often scientists were right in their educated guess about the existence of a relationship. R isn’t the ratio of findings where p >= 0.05 to findings where p < 0.05 - of null findings to non-null findings. Instead, it’s a measure of how often scientists are actually correct when they do a study to test whether a relationship exists.
R is characteristic of the field and can vary a lot depending on whether the field targets highly likely relationships or searches for only one or a few true relationships among thousands and millions of hypotheses that may be postulated.
If scientists are making educated guesses to decide what to study, then R is a measure of just how educated their guesses really are. If three of their hypotheses are actually true for every two that are actually false, then R = 3 / 2 = 1.5. If they make 100 actually false conjectures for every one that is actually true, then R = 1 / 100 = .01
Let us also consider, for computational simplicity, circumscribed fields where either there is only one true relationship (among many that can be hypothesized) or the power is similar to find any of the several existing true relationships.
Let’s imagine a psychologist is studying the relationship of hair color and personality type. Ioannidis is saying that we’re only considering one of two cases:
Hair color is linked to one, and only one, aspect of personality—but we don’t know which.
Hair color is linked at about the same level of strength to several aspects of personality. So for example, let’s pretend that redheads tend to be both a little more extroverted and neurotic than average, but not too much. If it’s much more linked to extroversion than neuroticism—i.e. we’d need to study 100 people to discover the link with extroversion, but 10,000 people to discover the link with neuroticism—then Ioannidis is not considering these two links as being in the same category. If instead, we’d need to study the same number of subjects to discover both relationships, then this is an example of what he’s talking about.
The pre-study probability of a relationship being true is R/(R + 1).
Using our examples above, if our scientists make 3 actually true conjectures for every 2 actually false conjectures, then R = 1.5 and they have a 1.5/(1.5 + 1) = 0.6 = 60% chance of any given conjecture that a relationship exists being true. Following the hair color example above, if our scientists have R = 1.5 for their hypotheses about links between hair color and personality, then every time they run a test there is a 60% chance (3 in 5) that the relationship they’re trying to detect actually exists.
That doesn’t mean there’s a 60% chance that they’ll find it. That doesn’t mean it’s a particularly strong relationship. And of course, they have no real way of knowing that their R value is 1.5, because it’s a measure of how often their guesses are actually true, not how often they replicate, how plausible they are. R is not directly measurable.
The probability of a study finding a true relationship reflects the power 1 - β (one minus the Type II error rate). The probability of claiming a relationship when none truly exists reflects the Type I error rate, α.
These are just more definitions. The power is the complement of the Type II error rate. If an actually true relationship exists and our study has an 80% chance of detecting it, then it also has a 20% (.2) Type II error rate—the chance of failing to detect it. If no actual relationship exists, but our study has a 5% chance of detecting one anyway, our Type I error rate (α) is .05.
Assuming that c relationships are being probed in the field, the expected values of the 2 × 2 table are given in Table 1.
If we magically could know the values of R, the Type I and Type II error rates, and the total number of research findings in a given field, we could determine the exact number of research findings that were correctly demonstrated to be actually true or actually false, and the number that were mistakenly found to be true or false.
Of course, we don’t know those numbers.
After a research finding has been claimed based on achieving formal statistical significance, the post-study probability that it is true is the positive predictive value, PPV.
Let’s say that we run 100 studies, and 40 of the claims achieve formal statistical significance. That doesn’t mean all 40 of them are actually true—some might be Type I errors. So let’s say that 10 of the 40 formally statistically significant are actually true. That means that PPV is 10⁄40, or 25%. It means that when we achieve a statistically significant finding, it only has a 25% chance of being actually true.
Of course, that’s just a made up number for illustrative purposes. It could be 99%, or 1% - who knows?
The PPV is also the complementary probability of what Wacholder et al. have called the false positive report probability.
So the “false positive report probability” is just 1 - PPV. In the above example with a PPV of 25%, the FPRP would be 75%.
According to the 2 × 2 table, one gets PPV = (1 - β)R/(R - βR + α).
So we have a formula to calculate the PPV based on R and the chances of Type I and Type II errors.
A research finding is thus more likely true than false if (1 - β)R > α.
So for example, imagine that our hair color/personality psychologist has a rate of 3 actually true conjectures to 2 actually false conjectures whenever he puts a hypothesis to the test. His R is 3⁄2 = 1.5.
If he has a 10% chance of missing an actually true relationship, and a 5% chance of finding an actually false relationship, then β = 0.1 and α = 0.05. In this case, (1 − 0.1) * 1.5 = 1.35, which is greater than α, meaning that his research findings are more likely true than false.
On the other hand, imagine that our psychologist has a pretty poor idea of how hair color and personality link up, so every time he conjectures a relationship and puts it to the test, only 1 relationship is real for every 100 he tests. His R value is 1⁄100 = .01. Using the same values for β and α, our formula is (1 - .1) * .01 = .009, which is less than α, so most of his findings that achieve statistical significance will actually be false.
In general, the more educated his guesses are, the more sensitive and specific his tests are for the effect he’s examining, the more likely any relationships he finds will be real.
Since usually the vast majority of investigators depend on a = 0.05, this means that a research finding is more likely true than false if (1 - β)R > 0.05.
By a, I believe Ioannidis means α. He’s just filling in the commonly-used threshold of “statistical significance.”
What is less well appreciated is that bias and the extent of repeated independent testing by different teams of investigators around the globe may further distort this picture and may lead to even smaller probabilities of the research findings being indeed true.
So far, Ioannidis has just been offering an equation to model the rate of false positive findings, if we knew the values of R, α, and β and had access to every single experiment our psychologist ever did. Now he’s pointing out that we don’t have access to every experiment by our psychologist, but rather a biased sample—just the data he was able to get published.
And furthermore, there might be some other hair color/personality psychologists studying the same questions independently. If enough of them look for a relationship that doesn’t exist, say between hair color and conscientiousness, then one of them will eventually find a link just due to random sampling error. For example, they’ll happen to get a sample of particularly conscientious blondes due to random chance and publish the finding, even though other psychologists studying the same relationship didn’t find a link. And the link is indeed not real—this team just happened to usher some particularly hard working blondes into their lab for no other reason than coincidence.
Now, although publication bias and repeated independent testing are real phenomena, Ioannidis so far has made no claims about how common they are. He’s just identifying that they probably have some influence in inflating our false positive rate—or, in other words, decreasing our PPV, the proportion of statistically significant findings that are actually true.
We will try to model these two factors in the context of similar 2 × 2 tables.
Just like Ioannidis was able to give us equations to calculate our statistically significant + actually true, false positive, false negative, and statistically insignificant + actually false rates, he’s going to give us equations that can take into account the level of bias and the effect of independent repeat testing on the PPV.
First, let us define bias as the combination of various design, data, analysis, and presentation factors that tend to produce research findings when they should not be produced.
What are some examples?
Maybe a psychologist eyeballs whether a test subject has red or blond hair, and is determining whether they’re conscientious by asking them to wash some dishes in the psych lab kitchen and deciding whether the dishes were clean enough. If he thinks that blondes are conscientious, he might inspect the plate before deciding whether the subject who washed them was strawberry blonde or a redhead.
Alternatively, he might dredge the data, or choose a statistical analysis that is more likely to give a statistically significant result in a borderline case.
Let u be the proportion of probed analyses that would not have been “research findings,” but nevertheless end up presented and reported as such, because of bias.
So if u = 0.1, then 10% of the studies that should have produced null results have been twisted and distorted into statistically significant findings. Again, we have no idea what u is (so far at least); it’s just a formalization of this concept.
Bias should not be confused with chance variability that causes some findings to be false by chance even though the study design, data, analysis, and presentation are perfect.
If our psychology researcher lets in 100 subjects to his study, and happens by coincidence to get some particularly conscientious blondes, that’s not bias. We just call it chance.
Bias can entail manipulation in the analysis or reporting of findings.
This could mean that our researcher has 5 different options for which statistical test to use on his data. Some are more strict, so that it looks impressive if the data is still significant. Others are more lax, but can be reasonable choices under some circumstances. Our researcher might try all 5, and pick the most strict-looking test that still produces a significant result. That’s a form of bias.
Selective or distorted reporting is a typical form of such bias.
Another is if he doesn’t have any better research ideas than the hair color/personality link. So if his first study produces a null result, he locks the data in his file drawer and runs the study again. He repeats this until he happens to get some particularly hard-working blondes, and then publishes just that data as his finding. Of course, if he does this, he’s not finding a real relationship. It’s like somebody who films himself flipping a quarter thousands of times until he gets 6 heads in a row, and then posts the clip on Youtube and claims he’s mastered the art of flipping a coin and getting heads every time.
We may assume that u does not depend on whether a true relationship exists or not.
Ioannidis is assuming that it doesn’t matter whether or not there really is a link between hair color and personality—our researcher will still behave in this biased manner either way. This is an a priori, intuitive assumption that he is making about the behavior of researchers. Why is it OK for him to make this assumption?
This is not an unreasonable assumption, since typically it is impossible to know which relationships are indeed true.
Remember, u is the proportion of studies that shouldn’t find a relationship, but do anyway, specifically due to bias.
Imagine the hair color/dish washing study used an automated hair-color-o-meter and dish-inspector, which would automatically judge both the hair color of the subjects and the cleanliness of the dishes. Furthermore, the scientist pre-registers the study and analysis plan in advance. Everything is completely roboticized—he’s not even physically present at the lab to influence how things proceed, and even the contents of the paper are pre-written. The data flows straight from the computers that measure it to another program that applies the pre-registered statistical analysis formula, spits out the result, and then auto-generates the text of the paper. All bias has been eliminated from the study, meaning that u has dropped to 0.
Note that there is still a chance of a false positive finding, even though u = 0.
Now, Ioannidis is saying that it’s reasonable to assume that the psychologist’s decision of whether or not to roboticize his studies and eliminate bias has nothing to do with whether or not there really is a link between hair color and personality.
Does that seem reasonable to you? It’s an empirical claim, and one offered with no supporting evidence. You’re allowed to have your own opinion.
Here’s an argument against Ioannidis’s assumption:
Maybe researchers end up in labs that are specialized to study a certain relationship. If that relationship actually exists, then they develop a culture of integrity, because they have success in generating significant findings using honest research practices.
On the other hand, if there is no real relationship, they’re all too invested, emotionally and materially, in the topic to admit that the relationship is false. So they develop a culture of corruption. They fudge little things and big things, justifying it to themselves and others, until they’re able to pretty reliably crank out statistically significant, but false findings. They keep indoctrinating new grad students and filter out the people who can’t stomach the bad behavior.
If this “corruption model” is true, then there is a relationship between the existence of a true relationship and u, the proportion of studies that lead to publications specifically due to bias.
Let’s bear in mind that some of Ioannidis’s further claims might hinge on this assumption he’s making without any evidence.
In the presence of bias (Table 2), one gets PPV = ([1 - β]R + uβR)/(R + α − βR + u − uα + uβR), and PPV decreases with increasing u, unless 1 − β ≤ α, i.e., 1 − β ≤ 0.05 for most situations.
If the extent of bias has no relationship with the existence of a real relationship, then this equation lets us model the chance that a published finding is true given all the other mystery variables we’ve discussed earlier. As u—the extent of bias—increases, the chance that any given published finding is actually true decreases.
Any correlation between the extent of bias and existence of a real relationship will introduce error into this equation.
Thus, with increasing bias, the chances that a research finding is true diminish considerably.
If there is no strong link between bias and actual truth or falsehood.
This is shown for different levels of power and for different pre-study odds in Figure 1.
We can plug in fake values for all these different variables and see what the value we really care about, the PPV, will be.
Conversely, true research findings may occasionally be annulled because of reverse bias. For example, with large measurement errors relationships are lost in noise , or investigators use data inefficiently or fail to notice statistically significant relationships, or there may be conflicts of interest that tend to “bury” significant findings.
So maybe our hair psychologist has no good test for conscientiousness or hair color, and therefore isn’t able to find an effect that is actually there. Maybe he does have the data to detect a real effect, but dies of a heart attack and it never gets published. Maybe the Hair Equity Association threatens to end the career of any scientist publishing findings that hair color and personality are linked.
There is no good large-scale empirical evidence on how frequently such reverse bias may occur across diverse research fields.
Guess we’ll have to make some more assumptions, then!
However, it is probably fair to say that reverse bias is not as common.
Why? Guess we’ll just have to take Prof. Ioannidis’s word for it. Man, honestly, at least with the last assumption about the lack of a link between bias and the existence of a real effect, he offered a reason why.
Moreover measurement errors and inefficient use of data are probably becoming less frequent problems, since measurement error has decreased with technological advances in the molecular era and investigators are becoming increasingly sophisticated about their data.
Let’s just note that this is an assumption about the frequency of measurement error and inefficient data use, which is in turn based on an assumption about the link between technological advances and measurement error, and an assumption about the change in researchers’ sophistication...
It’s assumptions all the way down, I guess.
Can we question them?
Maybe as technology proceeds, we’re able to try and detect subtler and more complex effects using our new tools. We push the boundaries of our methods. We fail to fully exploit the gigantic amounts of data that are available to us. And I just have no prior expectation that people are any more “sophisticated” now than they used to be. What does that even mean?
Regardless, reverse bias may be modeled in the same way as bias above.
So we can make up some numbers and visualize the combinations in a graph.
Also reverse bias should not be confused with chance variability that may lead to missing a true relationship because of chance.
Let’s say that blondes really do clean the dishes better than people with other hair colors. But on the day of our study, the subjects all just so happen to work equally hard at washing the dishes, so there’s a null finding. Ioannidis is reminding us that this isn’t bias—just the effect of chance.
Testing by Several Independent Teams
Several independent teams may be addressing the same sets of research questions.
So there might be a hair color/personality lab in Beijing and another in New York City, both running their own versions of the dish-washing study.
As research efforts are globalized, it is practically the rule that several research teams, often dozens of them, may probe the same or similar questions.
Another uncited empirical claim, but you know best, John! And honestly, this does seem plausible to me.
Unfortunately, in some areas, the prevailing mentality until now has been to focus on isolated discoveries by single teams and interpret research experiments in isolation.
So Ioannidis is saying that some scientists have a habit of focusing on just the output of the Beijing lab, or just the New York City lab, but not looking at the output of both labs as they probe the link between hair color and personality. So if the Beijing lab finds a link, but the New York City lab doesn’t, then the people following the Beijing lab will have an inflated opinion of the overall, global evidence of the link (and the people following the New York City data will have the opposite problem).
An increasing number of questions have at least one study claiming a research finding, and this receives unilateral attention.
Try giving a TED talk on hair color and personality based around a lack of a relationship. Hard to do. So if you’re trying to give that talk, you’re going to exclusively talk about the Beijing findings, and completely ignore the contradictory data out of NYC. I’m sure you can think of other examples where this sort of thing goes on. Ioannidis is trying to tell us that we have a habit of ignoring, or just failing to seek out, contradictory or inconvenient data. We can make stories more compelling by ignoring context, or by gathering supporting evidence into a giant mass, then using it to dismiss each contradictory finding as it pops up.
The probability that at least one study, among several done on the same question, claims a statistically significant research finding is easy to estimate.
Imagine we knew the values of all our “mystery variables” (R, β, and α, not considering bias), but hadn’t run any studies yet. Ioannidis has another equation to tell us the chance that at least one study would turn up a significant finding if we did run a given number of studies.
With increasing number of independent studies, PPV tends to decrease, unless 1 - β < a, i.e., typically 1 − β < 0.05.
1 − β is the power of the study—the chance of not getting a false negative (Type II error). So let’s say we have α = .05, meaning that we require a 5% or lower chance of a false positive to consider a study “statistically significant.” In that case, unless our studies are underpowered, more independent studies on the same question will tend to decrease the PPV.
You’d typically think that more high-powered studies would be a good thing. Why run one study on the link between hair color and conscientiousness when you could run ten such studies?
Well, let’s say there really isn’t a link. That means that a null result is getting at the truth.
You run one study, and find no result. Now you run another one, and again, no result. So far, you have a perfect track record. If you run another 100 studies, though, you might find a relationship—even though none exists—which will make your track record worse. Doing more testing actually decreased your accuracy.
This is shown for different levels of power and for different pre-study odds in Figure 2. For n studies of different power, the term β^n is replaced by the product of the terms βi for i = 1 to n, but inferences are similar.
Another “plug in the mystery numbers and see what the graph looks like” figure.
A practical example is shown in Box 1. Based on the above considerations, one may deduce several interesting corollaries about the probability that a research finding is indeed true.
Ioannidis is able to give us some rules of thumb based on the mathematical models he’s presented so far.
Note that while some of his empirical assumptions are separate from his mathematical models, one of his assumptions—that bias and the existence of actual relationships are not linked—is baked into his mathematical model. Insofar as there is a relationship here, and insofar as his corollaries depend on this fact, his conclusions here will be suspect.
Box 1. An Example: Science at Low Pre-Study Odds
Let us assume that a team of investigators performs a whole genome association study to test whether any of 100,000 gene polymorphisms are associated with susceptibility to schizophrenia.
This means that we’re looking for any possible genetic links with schizophrenia.
Based on what we know about the extent of heritability of the disease, it is reasonable to expect that probably around ten gene polymorphisms among those tested would be truly associated with schizophrenia, with relatively similar odds ratios around 1.3 for the ten or so polymorphisms and with a fairly similar power to identify any of them.
So in this field, we actually do have some information on some of our “mystery numbers.” If the odds ratio is 1, that means that there’s no relationship between a gene polymorphism and schizophrenia. If it’s greater or less than 1, there is a relationship. So an odds ratio of 1.3 means that a particular gene polymorphism is 30% more likely in the presence of schizophrenia and 30% less likely in the absence of it.
Then R = 10⁄100,000 = 10^−4, and the pre-study probability for any polymorphism to be associated with schizophrenia is also R/(R + 1) = 10^−4.
So if we tried to pick one of our candidate gene polymorphisms at random, there would be a 0.01% chance that it’s linked to schizophrenia.
Let us also suppose that the study has 60% power to find an association with an odds ratio of 1.3 at α = 0.05.
We’re imagining that 40% of the time, we’ll fail to find a true effect because our methods aren’t powerful enough.
Then it can be estimated that if a statistically significant association is found with the p-value barely crossing the 0.05 threshold, the post-study probability that this is true increases about 12-fold compared with the pre-study probability, but it is still only 12 × 10−4.
Every single gene we look at comes with a chance of turning up a false positive. That’s 10,000 chances to generate a false positive finding. And so even if our test is pretty specific, it still might turn up quite a few false positives.
By contrast, only 10 of the genes have a chance to generate a true positive finding. Since our study isn’t that powerful, we might miss a fair number of those true positives.
Taken together, using these numbers, we’re almost certainly going to get some positive findings—and they’re almost guaranteed to be false positives, even though there’s still much better-than-chance (but still tiny!) odds that they’re real.
Let’s say I set my alarm for a completely random time and hid it. You are trying to guess when it will go off. There are 86,400 seconds in a day, so by guessing at random, you’ve got a 1⁄86,400 chance of getting it right.
Now let’s say that you can see my fingers moving on the dials of the alarm clock as I set it, but can’t actually see what I press. You have to use my finger motions to guess what buttons I might have pressed. Even if this information can rule out enough possibilities to improve your guess by 10x, you still only have a 1⁄8,640 chance of guessing correctly.
Similarly, our hypothetical gene association test has improved our chances of guessing the real genes linked with schizophrenia, but the odds that our candidates are in fact correct is still very low.
Now let us suppose that the investigators manipulate their design, analyses, and reporting so as to make more relationships cross the p = 0.05 threshold even though this would not have been crossed with a perfectly adhered to design and analysis and with perfect comprehensive reporting of the results, strictly according to the original study plan.
This is where the bias sets in. They’d already identified a bunch of fake associations (and maybe a few real ones mixed in). Now they’re adding a bunch more fake associations, further diluting our chances of figuring out which links are real.
Such manipulation could be done, for example, with serendipitous inclusion or exclusion of certain patients or controls, post hoc subgroup analyses, investigation of genetic contrasts that were not originally specified, changes in the disease or control definitions, and various combinations of selective or distorted reporting of the results.
The point is that there are a lot of plausible-sounding analysis choices that are in fact nothing more than distortions used to invent a finding. For example, they’ve got a hard drive full of patient diagnoses. They can choose a threshold for whether or not a patient counts as “schizophrenic,” and select the threshold that gives them the most associations while still sounding like a reasonable definition of schizophrenia.
Commercially available “data mining” packages actually are proud of their ability to yield statistically significant results through data dredging.
If true, this suggests that through ignorance or incentivized self-justification, there’s enough researchers willing to do these kind of shenanigans to make a market for it. And probably there are convincing-sounding salesmen able to reassure the researchers that what they’re doing is fine, normal, and even mission-critical to doing good science. “You wouldn’t want to miss a link between genes and schizophrenia, would you? People might die because of your negligence if you don’t use our software!!!”
In the presence of bias with u = 0.10, the post-study probability that a research finding is true is only 4.4 × 10^−4.
So if 10% of the non-relationships get reported as significant due to bias and there’s no link between bias and existence of a relationship, then only .044% of the supposed “links” are actually real.
Furthermore, even in the absence of any bias, when ten independent research teams perform similar experiments around the world, if one of them finds a formally statistically significant association, the probability that the research finding is true is only 1.5 × 10−4, hardly any higher than the probability we had before any of this extensive research was undertaken!
And the more times we re-run this same experiment, the less able we’ll be to pick out the true links from amongst the false ones. Intuitively, this seems strange. Couldn’t we just look at which genes have the most overlap between the ten studies—in other words, do a meta-analysis?
I believe the whole issue here is that Ioannidis is presuming that we’re not doing a meta-analysis or in any other way comparing the results between these studies.
Corollary 1: The smaller the studies conducted in a scientific field, the less likely the research findings are to be true.
That makes sense. Fewer subjects means that random chance can have a bigger effect, tending to create significance where there is none.
Small sample size means smaller power and, for all functions above, the PPV for a true research finding decreases as power decreases towards 1 − β = 0.05.
This is just based on the equations, not any empirical assumptions.
Thus, other factors being equal, research findings are more likely true in scientific fields that undertake large studies, such as randomized controlled trials in cardiology (several thousand subjects randomized) than in scientific fields with small studies, such as most research of molecular predictors (sample sizes 100-fold smaller).
This doesn’t necessarily mean that cardiology is better science than molecular predictors, because there are other factors involved. It just means that molecular predictors could improve the reliability of their findings by increasing the sample size.
Corollary 2: The smaller the effect sizes in a scientific field, the less likely the research findings are to be true.
If there are links between hair color and various aspects of personality, but they’re very small, then any given “discovery” (say a relationship between hair color and Machiavellianism) has a greater chance of being random noise in the data rather than a real effect.
Power is also related to the effect size.
So the power of a study isn’t equivalent to, say, the magnification strength of a microscope. It’s more like the ability of the microscope to see the thing you’re trying to look at. That’s a function not only of magnification strength, but also the size of the thing under observation. Are you looking for an insect, a tardigrade, a eukaryotic cell, a bacteria, or a virus?
Thus research findings are more likely true in scientific fields with large effects, such as the impact of smoking on cancer or cardiovascular disease (relative risks 3–20), than in scientific fields where postulated effects are small, such as genetic risk factors for multigenetic diseases (relative risks 1.1–1.5).
It’s easier to detect obvious effects than subtle effects. That’s important to bear in mind, since our bodies are very complex machines, and it’s often very hard to see how all the little components add up to a big effect that we care about from a practical standpoint, such as the chance of getting a disease.
Modern epidemiology is increasingly obliged to target smaller effect sizes.
I.e., it’s increasing looking for subtle, difficult-to-detect effects.
Consequently, the proportion of true research findings is expected to decrease.
Ceterus paribus, John, ceterus paribus. They could gather more data, use more reliable methods, or get better at predicting in advance which conjectures are true in order to compensate.
In the same line of thinking, if the true effect sizes are very small in a scientific field, this field is likely to be plagued by almost ubiquitous false positive claims.
ALL ELSE BEING EQUAL. Is it reasonable to assume that researchers step up their measurement game in proportion to the subtlety of the effects they’re looking for? Or should we assume that researchers trying to study genetic risk factors are using the same, say, sample sizes, as are used to determine whether there’s a link between seat belt usage and car crash mortality?
For example, if the majority of true genetic or nutritional determinants of complex diseases confer relative risks less than 1.05, genetic or nutritional epidemiology would be largely utopian endeavors.
What I really wish Ioannidis had done here is shown how that 1.05 number interacts with his equations to make it intractable to produce true findings. This paragraph is the first time he used the term “effect size” in the whole paper, so it’s not easy to know if 1.05 is supposed to be R, or some other number.
It’s conceivable to me, as a non-statistician, that small effect sizes could make it exponentially more difficult to find a real effect.
But couldn’t these fields increase their sample sizes and measurement techniques to compensate for the subtlety of the effects they’re looking for? Couldn’t the genome-wide association study be repeated on just the relationships discovered the first time around, or a meta-analysis be performed, in order to separate the wheat from the chaff? I understand there’s a file-drawer problem involved here, but Ioannidis has already semi-written-off two fields as “utopian endeavors” before he’s even addressed this obvious rebuttal.
So as a non-expert, I have to basically decide whether I think he’s leaving out these details because they’re actually not very important, or whether he himself is biasing his own analysis to make it seem like a bigger deal than it really is.
Corollary 3: The greater the number and the lesser the selection of tested relationships in a scientific field, the less likely the research findings are to be true.
So “shotgun research” is going to get more false positives than “rifle research.”
As shown above, the post-study probability that a finding is true (PPV) depends a lot on the pre-study odds (R).
Thus, research findings are more likely true in confirmatory designs, such as large phase III randomized controlled trials, or meta-analyses thereof, than in hypothesis-generating experiments.
That makes perfect sense, and addresses my objection above: if our first study reduces 100,000 candidate genes to 10,000, and our second reduces that to 100, our third study might reliably identify just 2 genes which we can feel 90% sure are candidate genes linked with schizophrenia. We can also get there via meta-analyses or just getting bigger sample sizes.
Fields considered highly informative and creative given the wealth of the assembled and tested information, such as microarrays and other high-throughput discovery-oriented research, should have extremely low PPV.
One constructive way of looking at this corollary is that we can see research as a process of refinement, like extracting valuable minerals from ore. We start with enormous numbers of possible relationships and hack off a lot of the rock, even though we also lose some of the gold. We do repeat testing, meta-analysis, and speculate about mechanisms, culling the false positives, until finally at the end we’re left with a few high-carat relationships.
To thoughtfully interpret a study, we need to have a sense for what function it serves in the pipeline. If it’s a genome-wide association study, we shouldn’t presume that we can pick out the real genetic links from all those candidates. And of course, if there’s a lot of bias in our research, then even a lot of refinement might not be enough to get any gold out of the rocks.
Corollary 4: The greater the flexibility in designs, definitions, outcomes, and analytical modes in a scientific field, the less likely the research findings are to be true.
This is a great reason to do pre-registration and automate data collection. Rather than thinking about whether we buy the researchers’ definitions and designs, we just demand that they decide what they’re going to do in advance. Tie yourself to the mast, Odysseus!
Flexibility increases the potential for transforming what would be “negative” results into “positive” results, i.e., bias, u.
Of course, the degree to which flexibility creates genuine problems will depend on the field, the researchers in question, etc. But pre-registration seems relatively easy to do and like it would be really beneficial. So why wouldn’t you Just Do It?
For several research designs, e.g., randomized controlled trials or meta-analyses, there have been efforts to standardize their conduct and reporting.
That’s a hopeful sign. But I know there’s a Garbage In, Garbage Out problem as well. If the studies they’re based on aren’t pre-registered, won’t we still have problems?
Adherence to common standards is likely to increase the proportion of true findings.
Or else it’ll give undeserved credibility to meta-analyses that are covering up a serious file drawer problem. Seems to me like this is a bottom-up problem more than a top-down problem, unfortunately.
The same applies to outcomes. True findings may be more common when outcomes are unequivocal and universally agreed (e.g., death) rather than when multifarious outcomes are devised (e.g., scales for schizophrenia outcomes).
It’s going to be a balance sometimes, right? We might face a choice between an outcome measure that’s unequivocal and an outcome that precisely targets what we’re trying to study. Ideally, we just want to do lots of studies, looking at lots of related outcomes, and not rely too much on any one study or measure. This is old accepted wisdom.
Similarly, fields that use commonly agreed, stereotyped analytical methods (e.g., Kaplan-Meier plots and the log-rank test) may yield a larger proportion of true findings than fields where analytical methods are still under experimentation (e.g., artificial intelligence methods) and only “best” results are reported.
So just as, within a field, there might be a pipeline by which the ore of ideas gets refined into the gold of real effects, there might be whole fields that are still in early days, just starting to figure out how to even begin trying to measure the phenomenon of interest. Being savvy about that aspect of the field would also be important to interpreting the science taking place within it.
Of course, once again, the absolute age of the field will only be one aspect of how we guess the reliability of its methods. Artificial intelligence research may still be in relatively early days, but since it takes place on computers, it might be easier to gather lots of data or standardize the tests with precision.
Regardless, even in the most stringent research designs, bias seems to be a major problem. For example, there is strong evidence that selective outcome reporting, with manipulation of the outcomes and analyses reported, is a common problem even for randomized trails. Simply abolishing selective publication would not make this problem go away.
So even if journals published all null findings, researchers might not submit all their null findings to the journal in the first place. They might also manipulate their data to obtain statistical significance, just because that’s a splashier outcome. It sounds like pre-registration would help with this problem. Both researchers and journals need to police their own behavior to correct this problem.
Corollary 5: The greater the financial and other interests and prejudices in a scientific field, the less likely the research findings are to be true.
I don’t see how this corollary follows from Ioannidis’s equations. It seems to flow from his empirical assumptions about how the world works. It might be reasonable, but it’s an empirical question, unlike corollaries 1-4.
Conflicts of interest and prejudice may increase bias, u.
Plausible, but the corollary as stated is about the existence of “financial and other interests,” not “conflicts of interest.” For example, I could run a nonprofit lab studying whether red meat is associated with heart disease. There’s tremendous money in the meat industry and in the pharmaceutical industry. But would you say that my nonprofit lab is running a greater risk of conflict of interest than a for-profit company doing a trial on its own spina bifida drug, just because there’s more money in the red meat and heart-disease-treatment industry than spina bifida treatments?
I also think we need to be careful with how we think about “prejudice.” If Ioannidis means that the researchers themselves are prejudices about what their findings will be, it does seem plausible that they’ll find ways to distort their study to get the results they want. On the other hand, if a certain field is politically controversial, we can imagine many possibilities. Maybe there are two positions, and both are equally prejudice-laden. Maybe one of them is prejudice-laden, and the other is supported by the facts. Maybe the boundaries between one field and another are difficult to determine.
Without the kinds of clear-cut, pre-registered, well-vetted methods to determine what conflicts of interest and prejudice exist in a given field, and where the boundaries of that field lie, how are we to actually use these corollaries to evaluate the quality of evidence a field is producing? In the end, we’re right back to where we started: if you think a finding is bullshit because the researchers are prejudiced or motivated by financial interests, the onus is on you to come up with disconfirming evidence.
Conflicts of interest are very common in biomedical research, and typically they are inadequately and sparsely reported.
That does strike me as a real effect, but I’m not sure about the magnitude of this effect size. So for any given claim of biased research, Ioannidis would counsel me to bear in mind that the smaller the effect of conflicts of interest, the fewer actually true claims of biased false positives we will find.
Prejudice may not necessarily have financial roots. Scientists in a given field may be prejudiced purely because of their belief in a scientific theory or commitment to their own findings.
I’m sure there is a way to study this. For example, imagine there’s an overwhelmingly disconfirming study that gets published against a certain relationship. How does that study affect the rate of new studies carried out on that relationship?
Many otherwise seemingly independent, university-based studies may be conducted for no other reason than to give physicians and researchers qualifications for promotion or tenure. Such nonfinancial conflicts may also lead to distorted reported results and interpretations.
Many in relative or in absolute terms? May be conducted? This feels like a weasel-worded sentence to give us the impression of cynicism where none may be warranted. Again, I’m not a scientist (yet), so you can draw your own conclusions.
Prestigious investigators may suppress via the peer review process the appearance and dissemination of findings that refute their findings, thus condemning their field to perpetuate false dogma.
Just as you can list the many ways in which the flow of knowledge can be blocked, you can also think of the many ways knowledge can flow around barriers. Imagine a field is stymied by a few prestigious investigators protecting their pet theory by blocking competition through the peer review process. To what extent do you think it’s plausible that their dominion would block these contradictory findings, and for how long?
Can the scientists who discovered the contradictory findings get them published in a less prestigious journal? Can they publish some other way? Can they have behind-the-scenes discussions at conferences? Can they conspire to produce overwhelming evidence to the contrary? Can they mobilize to push out these prestigious propaganda-pushers?
Why, in this story, are the prestigious investigators so formidable, and the researchers they’re repressing so milquetoast, so weak? Why should we have any reason to think that? Here I am, writing a sentence-by-sentence breakdown of a famous paper with over 8,000 citations by a legendary statistician. Nothings stopping me from posting it on the internet. If people think it’s worth reading, they will.
Empirical evidence on expert opinion shows that it is extremely unreliable.
In light of that, how should we interpret the several uncited empirical opinions you’ve offered earlier in this paper, Prof. Ioannidis?
Also, what precisely does this have to do with prejudice and conflicts of interest? I thought this was about prejudice and conflicts of interest leading to distorted studies, not opinions getting proferred in the absence of a study?
Corollary 6: The hotter a scientific field (with more scientific teams involved), the less likely the research findings are to be true.
This is the one corollary that doesn’t even begin to make immediate, intuitive sense to me.
This seemingly paradoxical corollary follows because, as stated above, the PPV of isolated findings decreases when many teams of investigators are involved in the same field.
So if we assume that fields A and B are identical in every respect except the number of teams involved, then the one with fewer teams—i.e. better coordination, more cohesiveness in how data is collected and interpreted—will find a greater proportion of true findings. That makes sense. But I don’t think you can use the sheer number of teams working in a field as a referendum on how likely its findings are to be true, relative to other fields. You can only say that adding more teams to the same field will tend to lead to worse coordination, more repeat studies with less meta-analysis.
That makes perfect sense, but a way to phrase that constructively would be to say that we should try to improve coordination between teams working on the same problem. This way makes it sound like we should be automatically suspicious of hot fields, and I don’t see a reason for that a priori. Maybe hot fields just have a lot to chew on. Maybe they attract better researchers and enough funding to compensate for the coordination difficulties. Who knows?
This may explain why we occasionally see major excitement followed rapidly by severe disappointments in fields that draw wide attention.
Of course, it’s reasonable to assume that sometimes Ioannidis is exactly right. But this statement here is a form of cherry-picking. Just because poor research coordination and bias and all sorts of other problematic practices can and sometimes even do lead to swings between excitement and utter disappointment, doesn’t mean that it happens all the time. It doesn’t mean that the sheer number of teams working in a scientific field is very well correlated with the chance of such an event.
With many teams working on the same field and with massive experimental data being produced, timing is of the essence in beating competition.
On the other hand, just like in business, scientists decide what field to enter, differentiate themselves from the competition, and specialize or coordinate to focus on different aspects of the same issue, in order to avoid exactly this problem.
Thus, each team may prioritize on pursuing and disseminating its most impressive “positive” results.
Here’s what I don’t get. We’re imagining that there are a lot of false hypotheses, and only a few true relationships. Any given false positive is drawn from the much larger pool of false hypotheses. So why would these teams be feeling a sense of urgency that their competition will beat them to the punch, if they’re all racing to publish false positives?
We’d only expect urgency if two competing teams are racing to publish a real relationship. There’s plenty of bullshit to go around! From that point of view, the more of a race-to-publish dynamic we see, the more we should expect that the finding is in fact true.
On the other hand, if we see lots and lots of impressive positive results, all different, and we’re looking at a new field, with hazy measurements, then we should be getting suspicious. That sounds a lot like social psychology.
A metaphor here might be the importance of aseptic technique in tissue culturing. We want to use lots of safeguards to prevent microbial contamination. If one fails, the others can often safeguard the culture. But if all of our safeguards fail at once, then we should be really worried about the risk of contamination spreading throughout the lab.
“Negative” results may become attractive for dissemination only if some other team has found a “positive” association on the same question.
So why don’t researchers publish their 19 file-drawer null-result studies after the first false positive association on the same question makes it to press? Does that mean that prejudice is even worse than we fear? Are scientists afraid to criticize each other to the point that they’ll decide to forgo publication? Even if the data’s old, if you’re looking at somebody who found a positive association when you know you’ve got contradictory data locked up in your file drawer, why wouldn’t you say “Great! I’ll just do a repeat of that study—I already have some practice at it, after all.”
In that case, it may be attractive to refute a claim made in some prestigious journal.
It may, it may, it may. That word, “may,” turns up a lot in this paper.
Originally, the idea was that journals wouldn’t publish null results. Now it’s that they will, but scientists didn’t find it “attractive” to publish until the false-positive became a prestigious target. So what’s the implication? That scientists are doing studies on some obscure phenomenon, getting a null result, then locking it in a file drawer until some other team’s false positive makes it into Nature, and only then publishing their contradictory null result in order to contradict “an article that got published in Nature?”
Now that’s some 4-dimensional chess. Either scientists are the most Machiavellian crew around, and really picked the wrong profession, or else maybe Prof. Ioannidis is choosing his words to make a worst-case scenario sound like a common phenomenon.
The term Proteus phenomenon has been coined to describe this phenomenon of rapidly alternating extreme research claims and extremely opposite refutations .
Has been coined by whom? Let’s just check citation 29:
29. Ioannidis JP, Trikalinos TA (2005) Early extreme contradictory estimates may appear in published research: The Proteus phenomenon in molecular genetics research and randomized trials. J Clin Epidemiol 58: 543–549.
“The term Proteus phenomenon has been coined” sure sounds a lot more sciency than “I made up the term ‘Proteus phenomenon’.”
Empirical evidence suggests that this sequence of extreme opposites is very common in molecular genetics.
I’m going to take Ioannidis’s empirical claim here as scientific truth. I’m extremely glad he’s making effort to do empirical research on the phenomena he’s worried about, and I’m dead certain that he’s not just pulling all this out of his butt.
But again, if researchers are racing each other to publish, then you’d expect that they’ve converged on a single truth, not a single falsehood. I guess I’d need to read his empirical paper to decide if it made a compelling case that the “Proteus Phenomenon” in molecular genetics is due to having too many teams working on the same problem, or due to some other cause. Maybe they’re just misunderstanding the false positive rate of the techniques they’re using, getting their hopes up, and then having it all come crashing down.
These corollaries consider each factor separately, but these factors often influence each other.
This is what I’ve been complaining about throughout this section. Can’t improvements in one area compensate for shortcomings in another?
For example, investigators working in fields where true effect sizes are perceived to be small may be more likely to perform large studies than investigators working in fields where true effect sizes are perceived to be large.
So for Ioannidis, using a more powerful study to look for a smaller effect is… bad? So what, we should just not look for small effects?
Or prejudice may prevail in a hot scientific field, further undermining the predictive value of its research findings.
It may! It may! It may!
Highly prejudiced stakeholders may even create a barrier that aborts efforts at obtaining and disseminating opposing results.
Or they may create a barrier that stifles efforts… or restricts efforts… or hinders efforts… or slows efforts… or interferes with efforts… or complicates efforts… or really doesn’t do very much to efforts at all… or makes efforts look all the more necessary… or makes scientists work all the harder out of spite… or creates a political faction specifically to oppose their efforts in an equal and opposite reaction… or leads to lasting suspicion of industry-funded studies… or leads to lasting suspicion of studies that support industry claims even in the absence of conflicts of interest...
Conversely, the fact that a field is hot or has strong invested interests may sometimes promote larger studies and improved standards of research, enhancing the predictive value of its research findings. Or massive discovery-oriented testing may result in such a large yield of significant relationships that investigators have enough to report and search further and thus refrain from data dredging and manipulation.
Didn’t you just say that larger studies can make the predictive value worse?
“These corollaries consider each factor separately, but these factors often influence each other. For example, investigators working in fields where true effect sizes are perceived to be small may be more likely to perform large studies than investigators working in fields where true effect sizes are perceived to be large.”
Why yes, in context you did.
I’m quite confused now.
Anyway, I’m glad to hear Prof. Ioannidis say that hot fields with strong invested interests could be good or bad for the field. Of course, the rhetorical structure of this paper has tended to dwell on the bad. I’m scratching my head, thinking that there’s almost a parallel issue… like… I don’t know, maybe a scientist who crams a bunch of contradictory findings into a file drawer until we’re sold enough on a false positive that publishing the null result makes for a splashy article?
Most Research Findings Are False for Most Research Designs and for Most Fields
As shown, the majority of modern biomedical research is operating in areas with very low pre- and post-study probability for true findings.
As shown? Where did you show this? As stated, PPV depends on R, u, and the chance of Type I and Type II errors, and it’s rare that we’ll have strong evidence about what their true values are. This claim here is uncited, and I truly don’t see how it follows from the prior argument.
And furthermore, “true findings” imagines that every claim of statistical significance is read on a grand, unified list, from all the papers in the field. If that very stupid approach to interpretation was actually how scientists interpreted papers, it would be a huge problem. And I’m sure it is sometimes, and even more often how motivated interests outside of science will behave, waving a paper around and claiming it supports their hokum about some naturopathic treatment or whatever.
But just the fact that a field of science produces claims that are statistically significant doesn’t mean that its practitioners are such blathering idiots that they think that every statistically significant finding is God’s own truth. Does Ioannidis think he’s the only one who’s realized that a GWAS is going to turn up a lot of false positives?
Let us suppose that in a research field there are no true findings at all to be discovered. History of science teaches us that scientific endeavor has often in the past wasted effort in fields with absolutely no yield of true scientific information, at least based on our current understanding.
How often? How much effort? How do you define “field?” Are we talking about gigantic expenditures of scientific effort with nothing to show for it? Or are we talking about a field that amounted to a few scientists, for a few years, throwing up results, getting some attention, getting refuted, and eventually shutting down, amounting in total to 0.0001% of total scientific effort? I don’t know, and I can’t know, because this is an uncited empirical claims, without even an example as a reference point. Sure I can think up examples on my own, but I really have no idea whether we’re talking about a perpetual disaster or a few dramatic blow-ups here and there.
In such a “null field,” one would ideally expect all observed effect sizes to vary by chance around the null in the absence of bias. The extent that observed findings deviate from what is expected by chance alone would be simply a pure measure of the prevailing bias.
So Ioannidis is imagining a group of scientists studying the relationship between a person’s hair color and the chance of their getting heads on a coin flip. Any relationship discovered by these scientists would be purely due to bias.
For example, let us suppose that no nutrients or dietary patterns are actually important determinants for the risk of developing a specific tumor.
Or, as a more obviously null example, the conjecture that people’s hair is related to their chance of getting heads on a fair coin flip.
Let us also suppose that the scientific literature has examined 60 nutrients and claims all of them to be related to the risk of developing this tumor with relative risks in the range of 1.2 to 1.4 for the comparison of the upper to lower intake tertiles.
Or that the scientific literature has examined 60 aspects of hair color—shade, texture, length, and so on—and found all of them to be related to the chance of getting heads to a small but meaningful extent.
Then the claimed effect sizes are simply measuring nothing else but the net bias that has been involved in the generation of this scientific literature.
Well, hair has nothing to do with the chance of a coin coming up heads, so all these claims are just measuring the researchers’ tendency to manipulate their data, data dredge, hide null results, and so on.
Claimed effect sizes are in fact the most accurate estimates of the net bias. It even follows that between “null fields,” the fields that claim stronger effects (often with accompanying claims of medical or public health importance) are simply those that have sustained the worst biases.
Earlier in this paper, Ioannidis said that stronger effect sizes improve the PPV. Now he’s saying that they might also just indicate the bias in the field itself.
But another of his premises was that bias and the existence of an actual relationship were most likely unrelated.
You can’t have it both ways, John. Should we be worried because we see a large effect size (indicating bias)? Or should we be worried because we see a small effect size (indicating that findings are more likely to be false positives)?
For fields with very low PPV, the few true relationships would not distort this overall picture much.
So if a field is just getting its feet on the ground figuring out what’s related to what, then we should be extra skeptical of any individual finding. Fair enough. That doesn’t mean the field will stay in that immature state forever, or that we should be permanently skeptical of its conclusions. It just means we need to have an appreciation that scientific maturity of a field takes time. It is equally an argument for being more patient with a field in the early days, while it figures out its methods and mechanisms.
Even if a few relationships are true, the shape of the distribution of the observed effects would still yield a clear measure of the biases involved in the field.
For a field in early days, too much bias can potentially kill its ability to figure out the real relationships. We can imagine a field struggling to find a reliable, significant result. The grant money starts to dry up. So the researchers who are most invested in it find ways to manipulate their studies so that a couple of relationships suddenly start getting confirmed, again and again. And now they’re deeply invested in protecting those spurious “relationships.” And if other researchers get deeply attached to the subject area, intrigued by the strength of these findings, then the cycle can continue.
So we have a quandary. How can we distinguish a field in which a few reliable, true findings have been refined out of the conjectural ore, from a field in which a Franken-finding is running around murdering the truth?
This concept totally reverses the way we view scientific results.
Yes, if we totally accept it. But should we?
Traditionally, investigators have viewed large and highly significant effects with excitement, as signs of important discoveries. Too large and too highly significant effects may actually be more likely to be signs of large bias in most fields of modern research.
Under this view, large and highly significant effects are still cause for excitement. It’s like the GWAS example earlier. A large, highly significant effect might still be due to nothing but bias. But imagine a study comes out with an effect size of 1.05. Now imagine that the study actually had an effect size of 1.50. Now imagine that the effect size was 2.0. Did the actual truth of the effect seem more likely or less likely as the effect size increased?
It probably depends on context. If a psychologist comes out tomorrow saying that blondes are extraordinarily more neurotic than other hair colors, I think I would have noticed, and would guess that they screwed up their study somehow. On the other hand, if a GWAS finds that a particular genetic polymorphism has an extraordinarily strong link to prostate cancer, I have no particular reason to think that bias is responsible. It’s the way that the strength of the effect size fits in to my prior knowledge of the problem under study that informs my interpretation, not the sheer size of the effect alone. And I really can’t see why a super-low p-value would make me think that bias is more likely to be a culprit.
Ioannidis is not giving any reason why we’d think that modern research is particularly likely to be susceptible to bias.
I admit that he’s the PhD statistician, not I, but if I ignore his credentials and just look at the strength of his argument, I’m just not seeing it. He needs to try harder to convince me.
They should lead investigators to careful critical thinking about what might have gone wrong with their data, analyses, and results.
Which they should be doing anyway, after they sleep off their drunk from all the champagne-drinking.
Of course, investigators working in any field are likely to resist accepting that the whole field in which they have spent their careers is a “null field.”
Which would be a logical conclusion, if there is a low prior probability of any given decades-old field being a “null field.”
However, other lines of evidence, or advances in technology and experimentation, may lead eventually to the dismantling of a scientific field.
Once again, Ioannidis presents all these scary barriers to the truth, and once he’s got us sold on them, he presents us with the solution already knew existed as his own idea. Truth flows around the barriers.
Obtaining measures of the net bias in one field may also be useful for obtaining insight into what might be the range of bias operating in other fields where similar analytical methods, technologies, and conflicts may be operating.
Yes, please, I would love to see some actual empirical data on this matter. Of course, we have to remember that the people conducting this research into bias may themselves be equally, if not more, susceptible to bias. After all, who will watch the watchers?
And since we’ve already established in corollary 4 that greater flexibility in the study design can lead to greater rates of false positives, I’m curious to know how bias researchers will find sufficiently rigid definitions and measures of the field under study and the way bias is measured. With that caveat, I sincerely say more power to anybody who does this research!
How Can We Improve the Situation?
Is it unavoidable that most research findings are false, or can we improve the situation?
Well, I don’t think you’ve come anywhere close to proving that most research findings are false in this paper, so no, I wouldn’t say it’s unavoidable. In fact, having heard about this paper as some critically important lens for interpreting biomedical research for years now, I’m absolutely shocked at the weakness of the empirical evidence underlying its actual content.
But whatever the PPV is, there are clearly some steps we can take to improve it: pre-registration, making it easier for scientists to publish null results, better measures and greater sophistication, etc.
A major problem is that it is impossible to know with 100% certainty what the truth is in any research question. In this regard, the pure “gold” standard is unattainable.
Except that we can definitely be sure that most published research findings are false, of course.
However, there are several approaches to improve the post-study probability.
Better powered evidence, e.g., large studies or low-bias meta-analyses, may help, as it comes closer to the unknown “gold” standard.
So one of those large studies you said earlier could also lead to more false positives? Oh, and a “low-bias” meta-analysis might reduce bias? Great idea, keep ’em coming!
However, large studies may still have biases and these should be acknowledged and avoided.
And what if they can’t be avoided? Then maybe we just need to do more studies… Except wait, that was bad too...
Moreover, large-scale evidence is impossible to obtain for all of the millions and trillions of research questions posed in current research.
There are something like 2.5 million scientific papers published every year. Trillions of research questions? A trillion is a million million. Is the average research paper posing 400,000 research questions? Or is he just tossing around these numbers as a rhetorical device?
This is a little hard to take from a guy advocating such fastidiousness with the numbers.
Large-scale evidence should be targeted for research questions where the pre-study probability is already considerably high, so that a significant research finding will lead to a post-test probability that would be considered quite definitive.
So we want to be strategic with our resources. Rather than throwing a lot of money and brainpower at questions with a low a priori chance of being correct, we should focus our energies on reliably answering questions, where possible. Or alternatively, figuring out efficient ways to test many questions with a low a priori chance of being correct… like with those high-throughput studies he was talking about earlier...
Large-scale evidence is also particularly indicated when it can test major concepts rather than narrow, specific questions. A negative finding can then refute not only a specific proposed claim, but a whole field or considerable portion thereof.
So we should only do a 10,000-subject test on the association of hair color and personality when it stands a chance of definitively supporting or rejecting the existence of a broad link between hair color and personality in general, but not when it only stands to support or disprove some narrow link between a particular hair color and a particular personality trait. That seems like a good strategy. But then again, we also need big studies to detect or disprove small effects.
Actually, maybe we just need more funding for science so that we can adequately test all of these claims—both making room for newer fields to get established and for more mature fields to definitively answer the questions they’ve been wrestling with for years and decades.
We should also encourage young scientists to learn techniques that allow them to do the more powerful tests. Are we sufficiently exploiting the massive data being cranked out by the U.K. Biobank? How many biomedical grad students are we failing to teach to do big-data research, and instead to waste their efforts producing small data sets in separate labs because their mentors only know how to do wet lab work?
Selecting the performance of large-scale studies based on narrow-minded criteria, such as the marketing promotion of a specific drug, is largely wasted research.
Or more generally, if you have no plausible reason to think a big expensive study will find a real result, then you’re just recruiting scientists to write a script for the most boring advertisement in the world.
Moreover, one should be cautious that extremely large studies may be more likely to find a formally statistical significant difference for a trivial effect that is not really meaningfully different from the null.
And to be quite clear, this means that a deeply unimportant, miniscule effect, so tiny we shouldn’t be worrying about it at all, is more likely to be found by an extremely large study. It doesn’t mean that the results of extremely large studies are more likely to be trivial, or that increasingly large study size means we should have less trust that the relationship it detects is real.
Second, most research questions are addressed by many teams, and it is misleading to emphasize the statistically significant findings of any single team. What matters is the totality of the evidence.
Diminishing bias through enhanced research standards and curtailing of prejudices may also help. However, this may require a change in scientific mentality that might be difficult to achieve.
A modest proposal: what if we just feed anybody who’s had a study published in Nature to hungry grad students?
In some research designs, efforts may also be more successful with upfront registration of studies, e.g., randomized trials. Registration would pose a challenge for hypothesis-generating research.
I’ve worked as a music teacher for children for a decade. And this has the ring of truth to me. In my early years, I would constantly have ideas about children and their personalities, techniques that seemed to help get particular ideas across, ways to influence the kids’ behavior and maintain my energy levels. The vast majority of these ideas wouldn’t pan out, or would only work on one particular kid.
But a few ideas did stick, and occasionally they would redefine the way I taught. I had to really believe in each one, stick with it, be willing to give it a fair chance, tweak it, but set it aside if it didn’t work out.
The nice thing is that I got to directly observe the ideas in action, and because I like to see my students succeed, the incentive structure was really nice. It seems logical to me that we need some researchers to babble out new hypotheses with very little barrier, and then a series of ever-more-severe prunings until we arrive at the Cochrane database.
Some kind of registration or networking of data collections or investigators within fields may be more feasible than registration of each and every hypothesis-generating experiment. Regardless, even if we do not see a great deal of progress with registration of studies in other fields, the principles of developing and adhering to a protocol could be more widely borrowed from randomized controlled trials.
Just a thought. If other, less formal, less visible methods might work to reduce bias, is there maybe a chance that they’re already in place to some extent? This paper does a lot of priming us to believe that research is a lot of Wild West faith-healing quackery. What if instead scientists care about the truth and have found ways to circumvent the biases of their field in ways that John Ioannidis hasn’t noticed or seen fit to mention?
What if the studies that focus more on creative hypothesis-generation and less on rigorous falsification are just using methods appropriate to the wide end of the research funnel?
Finally, instead of chasing statistical significance, we should improve our understanding of the range of R values—the pre-study odds—where research efforts operate. Before running an experiment, investigators should consider what they believe the chances are that they are testing a true rather than a non-true relationship. Speculated high R values may sometimes then be ascertained.
In that case, we’d probably need to create grant-making structures that are tailored to hypothesis-generating or hypothesis-confirming/falsifying research. A study that is expected to generate 1,000 relationships, of which 2 will eventually be proven real and novel, might be as valuable as a study that takes a single pretty-solid hypothesis and convincingly determines that it’s true.
But the NIH probably wants to be clear on which type of evidence it’s buying. And until everybody’s on board with the idea that a 2/1000 hit rate can be worthwhile if the true findings can be teased out, and are novel and important enough, then early-stage hypothesis-generating research will probably keep on pretending that its findings are a lot more solid than they are, just to get the statistics cops like John Ioannidis off their backs.
I mean, just a thought. I don’t really know whether this story I’m spinning really reflects grant-making dynamics in scientific research. But I feel like all of Ioannidis’s assertions and assumptions are provoking an equal and opposite reaction from me. And I think that’s the proper way to read a paper like this. Counter hypothesis with hypothesis, and let the empirical data be the ultimate arbiter.
As described above, whenever ethically acceptable, large studies with minimal bias should be performed on research findings that are considered relatively established, to see how often they are indeed confirmed. I suspect several established “classics” will fail the test.
A good idea, and I’d love to see the data. Of course, when Ioannidis says he suspects “several” classics will fail the test, that’s only interesting if we know the sheer number of studies he’s imagining will be carried out, and whether he’s defining a “classic” as a study considered overwhelmingly confirmed, or as a popular study that doesn’t actually have as much support as you’d expect, given the number of times it gets referenced.
Nevertheless, most new discoveries will continue to stem from hypothesis-generating research with low or very low pre-study odds. We should then acknowledge that statistical significance testing in the report of a single study gives only a partial picture, without knowing how much testing has been done outside the report and in the relevant field at large.
After all the drama, this seems like a perfectly reasonable conclusion. Yes, we shouldn’t take every relationship turned up in the first GWAS done for some disease as even likely to be correct. “More research is required,” as they say.
Despite a large statistical literature for multiple testing corrections, usually it is impossible to decipher how much data dredging by the reporting authors or other research teams has preceded a reported research finding.
If it’s impossible to decipher, then that means we should be suspicious of any study that claims to have some fancy statistical method to detect data dredging. But it does seem possible, to some extent.
For example, if an author publishes some very weirdly-specific data in support of what should be a very broad conclusion, I might start to get suspicious. To make it concrete, imagine that our hair color/personality researcher goes around talking a big talk at conferences about how obvious he believes this link to be. And yet every time he comes out with a study, it’s something like “blondes with short hair and bangs tend to have slightly more Machiavellian traits when we analyze the coffee-pot-filling behavior of office workers on Mondays using this one unusual set of statistical techniques.” Sounds like somebody’s been having a little too much fun with an unusually rich and unique data set.
Even if determining this were feasible, this would not inform us about the pre-study odds. Thus, it is unavoidable that one should make approximate assumptions on how many relationships are expected to be true among those probed across the relevant research fields and research designs.
This would be an appropriate time to remind the audience that the title of this paper was “Why Most Published Research Findings Are False.” Well, by certain approximations and assumptions.
The wider field may yield some guidance for estimating this probability for the isolated research project.
Well, now that we know based on our assumptions and approximations that most published research findings are false, I guess that gives us a basis for estimating the probability of isolated fields? Perhaps with a few more assumptions? And once we’ve determined those, then we might actually get around making some assumptions about individual research questions?
Experiences from biases detected in other neighboring fields would also be useful to draw upon.
Yes, I have to assume they would be!
Even though these assumptions would be considerably subjective, they would still be very useful in interpreting research claims and putting them in context.
Look, there’s a very strong argument to be made here. It’s that if you’re trying to understand a research field, it’s helpful to see it as a funnel model. We start with limited understanding of the relationships at play. Gradually, we are able to elucidate them, but it takes time, so don’t just take P < 0.05 as gospel truth. Early on, it’s OK to do lots of cheap but not-too-decisive tests on tons of hypotheses. As the field matures, it needs to subject its most durable findings to increasingly decisive pressure tests. The relationships that survive will in turn inform how we interpret the plausibility of the more novel, less-supported findings being generated on the other end of the research pipeline.
This is a vision of an iterative design process, and it makes perfect sense.
My problem with this paper is that it projects a profoundly cynical view of this whole enterprise, and that it bases the title claim on little more than assumption piled on assumption. And it gets used as a tool by other cynics to browbeat people with even a modest appreciation for scientific research.
So let’s not do that anymore, OK?