A mentor of mine once told me that replication is useful, but not the most useful thing you could be doing because it’s often better to do a followup experiment that rests on the premises established by the initial experiment. If the first experiment was wrong, the second experiment will end up wrong too. Science should not go even slower than it already does—just update and move on, don’t obsess.
Tell me, does anyone actually do what you think they should do? That is, based on a long chain of ideas A->B->C->D, none of which have been replicated, upon experimenting and learning ~Z, do they ever reject the bogus theory D? (Or wait, was it C that should be rejected, or maybe the ~Z should be rejected as maybe the experiment just wasn’t powered enough to be meaningful as almost all studies are underpowered or, can you really say that Z logically entailed A...D? Maybe some other factor interfered with Z and so we can ‘save the appearances’ of A..Z! Yes, that’s definitely it!) “Theory-testing in psychology and physics: a methodological paradox”, Meehl 1967, puts it nicely (and this is as true as the day he wrote it half a century ago):
This last methodological sin is especially tempting in the “soft” fields of (personality and social) psychology, where the profession highly rewards a kind of “cuteness” or “cleverness” in experimental design, such as a hitherto untried method for inducing a desired emotional state, or a particularly “subtle” gimmick for detecting its influence upon behavioral output. The methodological price paid for this highly-valued “cuteness” is, of course, (d) an unusual ease of escape from modus tollens refutation. For, the logical structure of the “cute” component typically involves use of complex and rather dubious auxiliary assumptions,which are required to mediate the original prediction and are therefore readily available as (genuinely) plausible “outs” when the prediction fails. It is not unusual that (e) this ad hoc challenging of auxiliary hypotheses is repeated in the course of a series of related experiments,in which the auxiliary hypothesis involved in Experiment 1 (and challenged ad hoc in order to avoid the latter’s modus tollens impact on the theory) becomes the focus of interest in Experiment 2, which in turn utilizes further plausible but easily challenged auxiliary hypotheses, and so forth. In this fashion a zealous and clever investigator can slowly wend his way through a tenuous nomological network, performing a long series of related experiments which appear to the uncritical reader as a fine example of “an integrated research program”, without ever once refuting or corroborating so much as a single strand of the network.
To give a concrete example of why your advice is absurd and impractical and dangerous...
One of the things I am most proud of is my work on dual n-back not increasing IQ; the core researchers, in particular, the founder Jaeggi, are well-aware that their results have not replicated very well and that the results are almost entirely explained by bad control groups, and this is in part thanks to increased sample size from various followup studies which tried to repeat the finding while doing something else like an fMRI study or trying an emotional processing variant. So, what are they doing now, the Buschkuel lab and the new Jaeggi lab? Have they abandoned DNB/IQ, reasoning that since “the first experiment was wrong, the second experiment will end up wrong too”? Have they taken your advice to “just update and move on, don’t obsess”? Maybe taken serious stock of their methods and other results involving benefits to working memory training in general?
No. They are now busily investigating whether individual personality differences can explain transfer or not to IQ, whether other tasks can transfer, whether manipulation motivation can moderate transfer to IQ, and so on and so forth, and reaching p<0.05 and publishing papers just like they were before; but I suppose that’s all OK, because after all, “there are so many followup studies which are explained by [dual n-back transferring] really well that it seems a bit silly to throw out the notion of [dual n-back increasing IQ] just because of that”.
Wait, I’m not sure we’re talking about the same thing. I’m saying direct replication isn’t the most useful way to spend time. You’re talking about systematic experiment design flaws.
According to your writing, the failures in this example stem from methodological issues (not using an active control group). A direct replication of the n-back-IQ transfer would have just hit p<.05 again, as it would have had the same methodological issues. Of course, if the methodological issue is not repaired, all subsequent findings will suffer from the same issues.
I’m strictly saying that direct replication isn’t useful. Rigorous checking of methods and doing it over again correctly where there is a failure in the documented methodology is always a good idea.
But the Jaeggi cluster also sometimes use active control groups, with various kinds of differences in the intervention, metrics, and interpretations. In fact, Jaeggi was co-author on a new dual n-back meta-analysis released this month*; the meta-analysis finds the passive-active difference I did, and you know what their interpretation is? That it’s due to the correlated classification of US vs international laboratories conducting particular experiments. (It never even occurred to me to classify the studies this way.) They note that sometimes psychology experiments reach different conclusions in other cultures/countries—which they do—so perhaps the lower results in American studies using active control groups is because Americans gain less from n-back training. The kindest thing I can say about this claim is that I may be able to falsify it with my larger collection of studies (they threw out or missed a lot).
So, after performing these conceptual extensions of their results—as you suggest—they continue to
...slowly wend [their] way through a tenuous nomological network, performing a long series of related experiments which appear to the uncritical reader as a fine example of “an integrated research program”, without ever once refuting or corroborating so much as a single strand of the network.
You should probably have read part of the second sentence: “active vs passive control groups criticism: found, and it accounts for most of the net effect size”.
Tell me, does anyone actually do what you think they should do? That is, based on a long chain of ideas A->B->C->D, none of which have been replicated, upon experimenting and learning ~Z, do they ever reject the bogus theory D? (Or wait, was it C that should be rejected, or maybe the ~Z should be rejected as maybe the experiment just wasn’t powered enough to be meaningful as almost all studies are underpowered or, can you really say that Z logically entailed A...D? Maybe some other factor interfered with Z and so we can ‘save the appearances’ of A..Z! Yes, that’s definitely it!) “Theory-testing in psychology and physics: a methodological paradox”, Meehl 1967, puts it nicely (and this is as true as the day he wrote it half a century ago):
To give a concrete example of why your advice is absurd and impractical and dangerous...
One of the things I am most proud of is my work on dual n-back not increasing IQ; the core researchers, in particular, the founder Jaeggi, are well-aware that their results have not replicated very well and that the results are almost entirely explained by bad control groups, and this is in part thanks to increased sample size from various followup studies which tried to repeat the finding while doing something else like an fMRI study or trying an emotional processing variant. So, what are they doing now, the Buschkuel lab and the new Jaeggi lab? Have they abandoned DNB/IQ, reasoning that since “the first experiment was wrong, the second experiment will end up wrong too”? Have they taken your advice to “just update and move on, don’t obsess”? Maybe taken serious stock of their methods and other results involving benefits to working memory training in general?
No. They are now busily investigating whether individual personality differences can explain transfer or not to IQ, whether other tasks can transfer, whether manipulation motivation can moderate transfer to IQ, and so on and so forth, and reaching p<0.05 and publishing papers just like they were before; but I suppose that’s all OK, because after all, “there are so many followup studies which are explained by [dual n-back transferring] really well that it seems a bit silly to throw out the notion of [dual n-back increasing IQ] just because of that”.
Wait, I’m not sure we’re talking about the same thing. I’m saying direct replication isn’t the most useful way to spend time. You’re talking about systematic experiment design flaws.
According to your writing, the failures in this example stem from methodological issues (not using an active control group). A direct replication of the n-back-IQ transfer would have just hit p<.05 again, as it would have had the same methodological issues. Of course, if the methodological issue is not repaired, all subsequent findings will suffer from the same issues.
I’m strictly saying that direct replication isn’t useful. Rigorous checking of methods and doing it over again correctly where there is a failure in the documented methodology is always a good idea.
But the Jaeggi cluster also sometimes use active control groups, with various kinds of differences in the intervention, metrics, and interpretations. In fact, Jaeggi was co-author on a new dual n-back meta-analysis released this month*; the meta-analysis finds the passive-active difference I did, and you know what their interpretation is? That it’s due to the correlated classification of US vs international laboratories conducting particular experiments. (It never even occurred to me to classify the studies this way.) They note that sometimes psychology experiments reach different conclusions in other cultures/countries—which they do—so perhaps the lower results in American studies using active control groups is because Americans gain less from n-back training. The kindest thing I can say about this claim is that I may be able to falsify it with my larger collection of studies (they threw out or missed a lot).
So, after performing these conceptual extensions of their results—as you suggest—they continue to
So it goes.
* http://www.gwern.net/docs/dnb/2014-au.pdf / https://pdf.yt/d/VMPWmd0jpDYvZIjm / https://dl.dropboxusercontent.com/u/85192141/2014-au.pdf ; initial comments on it: https://groups.google.com/forum/#!topic/brain-training/GYqqSyfqffA
The first sentence in your dual-n-back article is:
If you believe that there’s a net gain of medium effect size then why do you think we should throw dual n-back under the bus?
You should probably have read part of the second sentence: “active vs passive control groups criticism: found, and it accounts for most of the net effect size”.