Aren’t the control and experimental groups supposed to be as close to identical as possible, and to help analysis identify which subgroups, if any, had effects different from other subgroups?
Ideally, yes, but if you shuffle people around, you’re not necessarily doing yourself any favors. (I think. This seems to be related to an old debate in experimental design going back to Gosset and Fisher over ‘balanced’ versus ‘randomized’ designs, which I don’t understand very well.)
If an intervention showed significantly different results for tall people than for short people, then a study of that intervention on people based on height may be indicated.
This is part of the randomized vs balanced design debate. Suppose tall people did better, but you just randomly allocated people; with a small sample like, say, 10 total and 5 in each, you would expect to wind up with different numbers of tall people in your control and experimentals (eg a 4-1 split of 5 tall people), and now that may be driving the difference. If you were using a large sample like 5000 people, then you’d expect the random allocation to be very even between the two groups of 2500.
If you specify in advance that tall people are a possibility, you can try to ‘balance’ the groups by additional steps: for example, you might randomize short people as usual, but block (randomize) pairs of tall people—if heads, the guy on the left is in the experimental and right in control, if tails, other way around—where by definition you get an even split of tall people (and maybe 1 guy left over). This is fine, sensible, and efficient use of your sample, and if you’re testing additional hypotheses like ‘tall people score better, even on top of the intervention’, you’ll take appropriate measures like increasing your sample size to reach your desired statistical power / alpha parameters. No problems there.
But any post hoc analysis can be abused. If after you run your study you decide to look at how tall people did, you may have an unbalanced split driving any result, you’re increasing how many hypotheses you’re testing, and so on. Post hoc analyses are untrustworthy and suspicious; here’s an example where a post hoc analysis was done: http://lesswrong.com/lw/68k/nback_news_jaeggi_2011_or_is_there_a/
I was saying that if there was any reason to suspect height might be a factor, then height should be added to the factors considered when trying to make the groups indistinguishable from each other. If height isn’t suspected to be a factor, adding height to those factors with a low weight does almost no harm to the rest of the distribution.
Is there any excuse for the measured variable to notably differ between the control and experimental groups in a well-executed experiment?
I was saying that if there was any reason to suspect height might be a factor, then height should be added to the factors considered when trying to make the groups indistinguishable from each other.
In a perfect world, perhaps. But every variable is more effort, and you need to do it from the start or else you might wind up screwing things up (imagine processing people one by one over a few weeks and starting their intervention, and half-way through, noticing that height is differing between the groups...?)
Is there any excuse for the measured variable to notably differ between the control and experimental groups in a well-executed experiment?
If you didn’t balance them, it may easily happen. And the more variables that describe each person, the more likely the groups will be unbalanced by some variable. People are complex like that. If you’re interested in the topic, I’ve already pointed you at the Wikipedia articles, but you could also check out Ziliak’s papers.
I see where gathering information about all participants before starting the intervention might not be possible. It should still be possible to maximize balance with each batch added, but that means a tradeoff between balancing each batch and balancing the experiment as a whole. For a given experiment, we would have to decide the relative likelihood that that there would be a confounding variable which in the batches or a confounding variable in the demographics.
The undetected confounding variable is always a possibility. That doesn’t mean that we can’t or shouldn’t do as much about it as the expected gains offset the expected costs, and doing some really complicated math to divide the sample into two groups isn’t much more expensive than collecting the data to go into it.
Ideally, yes, but if you shuffle people around, you’re not necessarily doing yourself any favors. (I think. This seems to be related to an old debate in experimental design going back to Gosset and Fisher over ‘balanced’ versus ‘randomized’ designs, which I don’t understand very well.)
This is part of the randomized vs balanced design debate. Suppose tall people did better, but you just randomly allocated people; with a small sample like, say, 10 total and 5 in each, you would expect to wind up with different numbers of tall people in your control and experimentals (eg a 4-1 split of 5 tall people), and now that may be driving the difference. If you were using a large sample like 5000 people, then you’d expect the random allocation to be very even between the two groups of 2500.
If you specify in advance that tall people are a possibility, you can try to ‘balance’ the groups by additional steps: for example, you might randomize short people as usual, but block (randomize) pairs of tall people—if heads, the guy on the left is in the experimental and right in control, if tails, other way around—where by definition you get an even split of tall people (and maybe 1 guy left over). This is fine, sensible, and efficient use of your sample, and if you’re testing additional hypotheses like ‘tall people score better, even on top of the intervention’, you’ll take appropriate measures like increasing your sample size to reach your desired statistical power / alpha parameters. No problems there.
But any post hoc analysis can be abused. If after you run your study you decide to look at how tall people did, you may have an unbalanced split driving any result, you’re increasing how many hypotheses you’re testing, and so on. Post hoc analyses are untrustworthy and suspicious; here’s an example where a post hoc analysis was done: http://lesswrong.com/lw/68k/nback_news_jaeggi_2011_or_is_there_a/
I was saying that if there was any reason to suspect height might be a factor, then height should be added to the factors considered when trying to make the groups indistinguishable from each other. If height isn’t suspected to be a factor, adding height to those factors with a low weight does almost no harm to the rest of the distribution.
Is there any excuse for the measured variable to notably differ between the control and experimental groups in a well-executed experiment?
In a perfect world, perhaps. But every variable is more effort, and you need to do it from the start or else you might wind up screwing things up (imagine processing people one by one over a few weeks and starting their intervention, and half-way through, noticing that height is differing between the groups...?)
If you didn’t balance them, it may easily happen. And the more variables that describe each person, the more likely the groups will be unbalanced by some variable. People are complex like that. If you’re interested in the topic, I’ve already pointed you at the Wikipedia articles, but you could also check out Ziliak’s papers.
I see where gathering information about all participants before starting the intervention might not be possible. It should still be possible to maximize balance with each batch added, but that means a tradeoff between balancing each batch and balancing the experiment as a whole. For a given experiment, we would have to decide the relative likelihood that that there would be a confounding variable which in the batches or a confounding variable in the demographics.
The undetected confounding variable is always a possibility. That doesn’t mean that we can’t or shouldn’t do as much about it as the expected gains offset the expected costs, and doing some really complicated math to divide the sample into two groups isn’t much more expensive than collecting the data to go into it.