Simple explanation of meta-analysis; below is a copy of my attempt to explain basic meta-analysis on the DNB ML. I thought I might reuse it elsewhere, and I’d like to know whether it really is a good explanation or needs fixing.
A useful concept is the hierarchy of evidence: we all know anecdote are close to worthless, correlations or surveys fairly weak, experiments good, randomized experiments better, controlled randomized experiments much better, and blind controlled randomized experiments best. If a randomized experiment contradicts an anecdote, we know to believe the experiment; and if a blind controlled randomized experiment contradicts an experiment, we know to believe the blind controlled randomized experiment. But what happens when we have a bunch of studies on the same level… which don’t agree? What do we do if only 3 out of 5 experiments report the same result? We need to somehow combine the 5 experiments into 1 final result. The process of combining them is a “meta-analysis”.
What parts of the experiments get combined may surprise you if you’ve read a few papers. Meta-analyses usually presume you know what an ‘effect size’ is. This is different from stuff like p-values, even though p-values are what everyone usually focuses on when judging results! The difference is that p-values say whether there is a difference between the control and experiment, while effect sizes say how big the difference is. It turns out that you can’t really combine p-values from different studies, but you can combine effect sizes.
Each study gives you an effect size, based on the averages and standard deviation (how variable or jumpy the data is). What do you do with 10 effect sizes? How do you combine or add or aggregate them? That’s where meta-analysis comes in.
You could just treat each as a vote: if 6 of the effect sizes are positive, and 4 are negative, then declare victory: “There’s an effect of X size.” (Some of the first meta-analyses, like the famous one combining studies of psychic effects, did just this.)
But what if some of the effects are huge, like 0.9, and all the others are 0.1? If we just vote, we get 0.1 since that’s the majority. But is 0.1 really the right answer here? Doesn’t seem like it.
So instead of voting, let’s average! We add up the 10 studies and get something like +5; then divide by 10 and get 0.5 as our estimate. Much more reasonable: 0.9 seems too high like they may be outliers, but 0.1 is kind of weird since we did get some 0.9s; we split the difference.
But studies don’t always have the same number of subjects, and as we all know, the more subjects or data you have, the better an estimate you have of the true value. A study with 10 students in it is worth much less than a study which used 10,000 students! A simple average ignores this truth.
So let’s weight each effect size by how many subjects/datapoints it had in it: the effect size from the study with 10 students is much smaller* than the one from 10,000 students. So now if the first 9 studies have ~10 datapoints, and the 10th study has 1000 datapoints, those 9 count as, say, 1/10th* the last study since they totaled ~100 to its 1,000.
So each effect size gets weighted by how many datapoints went into making it, and then they’re averaged together as before to give One Effect Size To Rule Them All.
With this done, we can start looking at other questions like:
confidence intervals (this One Effect Size is not exactly right, of course, but how far away is it from the true effect size?)
heterogeneity (are we comparing apples and apples? or did we include some oranges)
or biases (funnel plots and trim-and-fill: does it look like some studies are missing?)
These other factors help us in the unlikely case that we have multiple meta-analyses at odds:
which meta-analysis is made up of studies higher on the hierarchy? A meta-analysis of experiments beats a meta-analysis of surveys, just like experiments beat surveys.
which has more studies in it?
which has smaller confidence intervals?
which has lower heterogeneity?
which looks better on the bias checks? etc.
An example of the further questions we can ask:
In the case of the DNB meta-analysis, we can look at the One Effect Size over all studies which was something like 0.5. But some studies are high and some are low; is there any way to predict which are high and low? Is there some characteristic that might cause the effect sizes to be high or low? I suspected that there was: the methodological critique of active vs passive control groups. (I actually suspected this before the Melby meta-analysis came out, which did the same thing over a larger selection of WM-related studies.)
So I subcategorize the effect sizes from active control groups and the ones with passive control groups, and I do 2 smaller separate meta-analyses on each category. Did the 2 smaller meta-analyses spit out roughly the same answer as the full meta-analysis? No, they did not! They spat out quite different answers: studies with passive control groups found that the effect size was large, and studies with active control groups found that the effect size was small. This serves as very good evidence that yes, the critique is right, since it’s not that likely that a random split of studies would separate them so nicely.
And that’s the meat of my meta-analysis. I hope this was helpful?
Great explanation, but I think you could improve it by putting it within the context of the hierarchy of evidence (i.e., how it should be weighted as evidence), and mentioning its flaws. Often in skeptic circles I saw people using meta-analyses as the nuclear option in arguments with alternative medicine supporters or such—things got awkward when both sides had a meta-analysis in their favor.
Actually, I’m surprised someone hasn’t made a post on how to weight research in general (that probably means someone has).
Simple explanation of meta-analysis; below is a copy of my attempt to explain basic meta-analysis on the DNB ML. I thought I might reuse it elsewhere, and I’d like to know whether it really is a good explanation or needs fixing.
Hm, I don’t really know of any such explanation; there’s Wikipedia, of course: http://en.wikipedia.org/wiki/Meta-analysis
A useful concept is the hierarchy of evidence: we all know anecdote are close to worthless, correlations or surveys fairly weak, experiments good, randomized experiments better, controlled randomized experiments much better, and blind controlled randomized experiments best. If a randomized experiment contradicts an anecdote, we know to believe the experiment; and if a blind controlled randomized experiment contradicts an experiment, we know to believe the blind controlled randomized experiment. But what happens when we have a bunch of studies on the same level… which don’t agree? What do we do if only 3 out of 5 experiments report the same result? We need to somehow combine the 5 experiments into 1 final result. The process of combining them is a “meta-analysis”.
What parts of the experiments get combined may surprise you if you’ve read a few papers. Meta-analyses usually presume you know what an ‘effect size’ is. This is different from stuff like p-values, even though p-values are what everyone usually focuses on when judging results! The difference is that p-values say whether there is a difference between the control and experiment, while effect sizes say how big the difference is. It turns out that you can’t really combine p-values from different studies, but you can combine effect sizes.
Each study gives you an effect size, based on the averages and standard deviation (how variable or jumpy the data is). What do you do with 10 effect sizes? How do you combine or add or aggregate them? That’s where meta-analysis comes in.
You could just treat each as a vote: if 6 of the effect sizes are positive, and 4 are negative, then declare victory: “There’s an effect of X size.” (Some of the first meta-analyses, like the famous one combining studies of psychic effects, did just this.)
But what if some of the effects are huge, like 0.9, and all the others are 0.1? If we just vote, we get 0.1 since that’s the majority. But is 0.1 really the right answer here? Doesn’t seem like it.
So instead of voting, let’s average! We add up the 10 studies and get something like +5; then divide by 10 and get 0.5 as our estimate. Much more reasonable: 0.9 seems too high like they may be outliers, but 0.1 is kind of weird since we did get some 0.9s; we split the difference.
But studies don’t always have the same number of subjects, and as we all know, the more subjects or data you have, the better an estimate you have of the true value. A study with 10 students in it is worth much less than a study which used 10,000 students! A simple average ignores this truth.
So let’s weight each effect size by how many subjects/datapoints it had in it: the effect size from the study with 10 students is much smaller* than the one from 10,000 students. So now if the first 9 studies have ~10 datapoints, and the 10th study has 1000 datapoints, those 9 count as, say, 1/10th* the last study since they totaled ~100 to its 1,000.
So each effect size gets weighted by how many datapoints went into making it, and then they’re averaged together as before to give One Effect Size To Rule Them All.
With this done, we can start looking at other questions like:
confidence intervals (this One Effect Size is not exactly right, of course, but how far away is it from the true effect size?)
heterogeneity (are we comparing apples and apples? or did we include some oranges)
or biases (funnel plots and trim-and-fill: does it look like some studies are missing?)
These other factors help us in the unlikely case that we have multiple meta-analyses at odds:
which meta-analysis is made up of studies higher on the hierarchy? A meta-analysis of experiments beats a meta-analysis of surveys, just like experiments beat surveys.
which has more studies in it?
which has smaller confidence intervals?
which has lower heterogeneity?
which looks better on the bias checks? etc.
An example of the further questions we can ask:
In the case of the DNB meta-analysis, we can look at the One Effect Size over all studies which was something like 0.5. But some studies are high and some are low; is there any way to predict which are high and low? Is there some characteristic that might cause the effect sizes to be high or low? I suspected that there was: the methodological critique of active vs passive control groups. (I actually suspected this before the Melby meta-analysis came out, which did the same thing over a larger selection of WM-related studies.)
So I subcategorize the effect sizes from active control groups and the ones with passive control groups, and I do 2 smaller separate meta-analyses on each category. Did the 2 smaller meta-analyses spit out roughly the same answer as the full meta-analysis? No, they did not! They spat out quite different answers: studies with passive control groups found that the effect size was large, and studies with active control groups found that the effect size was small. This serves as very good evidence that yes, the critique is right, since it’s not that likely that a random split of studies would separate them so nicely.
And that’s the meat of my meta-analysis. I hope this was helpful?
* how much smaller? Well, that’s where statistics comes in. It’s not a simple linear sort of thing: 100 subjects is not 10x better than 10 subjects, but less than 10x better. Diminishing returns. Some formula and power calculations in https://plus.google.com/u/0/103530621949492999968/posts/i4RB2DHnW5y
Great explanation, but I think you could improve it by putting it within the context of the hierarchy of evidence (i.e., how it should be weighted as evidence), and mentioning its flaws. Often in skeptic circles I saw people using meta-analyses as the nuclear option in arguments with alternative medicine supporters or such—things got awkward when both sides had a meta-analysis in their favor.
Actually, I’m surprised someone hasn’t made a post on how to weight research in general (that probably means someone has).
OK, I’ve edited it heavily. How is it now?
http://i.imgur.com/rOmjZ.gif