The examples in the post where authors disagreed heavily about the sign of the effect (school → pregnancy and immigration → social policy support) are both questions where I’d expect, a priori, to find small-and-inconsistent effect sizes. And if we ignore “statistical significance” and look at effect sizes in the graphs, it indeed looks like almost all the researchers on those questions found tiny effects—plus or minus 0.02 for the first, or 0.05 for the second. (Unclear what the units are on those, so maybe I’m wrong about the effects being small, but I’m guessing it’s some kind of standardized effect size.) The statistical significance or sign of the effect isn’t all that relevant—the important part is that almost all researchers agree the effect is tiny.
On the flip side, for the soccer study, the effect sizes are reasonably large. Assuming I’m reading that graph right, the large majority of researchers find that dark-skinned players are ~30% more likely to get a red-card. There’s still a wide range of estimates, but the researchers mostly agree that the effect is large, and they mostly agree on the direction.
So I don’t think it really makes sense to interpret these as “many results”. The takeaway is not “analysis has too many degrees of freedom for results to replicate”. The takeaway is “statistical significance by itself sucks, look at effect sizes”. It’s effect sizes which reproduce, and it’s effect sizes which matter anyway for most practical purposes (as opposed to just getting papers published). For the most part, the teams only disagree on whether numbers which are basically 0 are +0 or −0.
Are the methods of analysis considered part of the “methodological technology” this thread of research considers incomplete?
If so, the whole thing sort of trivializes to “statistics suck, and therefore science methodologically sucks.” On the flip side, how difficult/expensive would it be to run a series of these specifying the analytical methods in the same way the hypothesis and data sources were specified? One group does effect sizes instead of significance, one group does likelihood functions instead of significance, etc.
I keep updating in favor of a specialization-of-labor theory for reorganizing science. First order of business: adding analysts to create a Theory/Experiment/Analysis trifecta.
The examples in the post where authors disagreed heavily about the sign of the effect (school → pregnancy and immigration → social policy support) are both questions where I’d expect, a priori, to find small-and-inconsistent effect sizes. And if we ignore “statistical significance” and look at effect sizes in the graphs, it indeed looks like almost all the researchers on those questions found tiny effects—plus or minus 0.02 for the first, or 0.05 for the second. (Unclear what the units are on those, so maybe I’m wrong about the effects being small, but I’m guessing it’s some kind of standardized effect size.) The statistical significance or sign of the effect isn’t all that relevant—the important part is that almost all researchers agree the effect is tiny.
On the flip side, for the soccer study, the effect sizes are reasonably large. Assuming I’m reading that graph right, the large majority of researchers find that dark-skinned players are ~30% more likely to get a red-card. There’s still a wide range of estimates, but the researchers mostly agree that the effect is large, and they mostly agree on the direction.
So I don’t think it really makes sense to interpret these as “many results”. The takeaway is not “analysis has too many degrees of freedom for results to replicate”. The takeaway is “statistical significance by itself sucks, look at effect sizes”. It’s effect sizes which reproduce, and it’s effect sizes which matter anyway for most practical purposes (as opposed to just getting papers published). For the most part, the teams only disagree on whether numbers which are basically 0 are +0 or −0.
Are the methods of analysis considered part of the “methodological technology” this thread of research considers incomplete?
If so, the whole thing sort of trivializes to “statistics suck, and therefore science methodologically sucks.” On the flip side, how difficult/expensive would it be to run a series of these specifying the analytical methods in the same way the hypothesis and data sources were specified? One group does effect sizes instead of significance, one group does likelihood functions instead of significance, etc.
I keep updating in favor of a specialization-of-labor theory for reorganizing science. First order of business: adding analysts to create a Theory/Experiment/Analysis trifecta.