Really interesting analysis of social science papers and replication markets. Some excerpts:
Over the past year, I have skimmed through 2578 social science papers, spending about 2.5 minutes on each one. This was due to my participation in Replication Markets, a part of DARPA’s SCORE program, whose goal is to evaluate the reliability of social science research. 3000 studies were split up into 10 rounds of ~300 studies each. Starting in August 2019, each round consisted of one week of surveys followed by two weeks of market trading. I finished in first place in 3 out 10 survey rounds and 6 out of 10 market rounds. In total, about $200,000 in prize money will be awarded.
The studies were sourced from all social sciences disciplines (economics, psychology, sociology, management, etc.) and were published between 2009 and 2018 (in other words, most of the sample came from the post-replication crisis era).
The average replication probability in the market was 54%; while the replication results are not out yet (175 of the 3000 papers will be replicated), previous experiments have shown that prediction markets work well.1
This is what the distribution of my own predictions looks like:2
[...]
Check out this crazy chart from Yang et al. (2020):Yes, you’re reading that right: studies that replicate are cited at the same rate as studies that do not. Publishing your own weak papers is one thing, but citing other people’s weak papers? This seemed implausible, so I decided to do my own analysis with a sample of 250 articles from the Replication Markets project. The correlation between citations per year and (market-estimated) probability of replication was −0.05!
You might hypothesize that the citations of non-replicating papers are negative, but negative citations are extremely rare.5 One study puts the rate at 2.4%. Astonishingly, even after retraction the vast majority of citations are positive, and those positive citations continue for decades after retraction.6
As in all affairs of man, it once again comes down to Hanlon’s Razor. Either:
Malice: they know which results are likely false but cite them anyway.
or, Stupidity: they can’t tell which papers will replicate even though it’s quite easy.
Accepting the first option would require a level of cynicism that even I struggle to muster. But the alternative doesn’t seem much better: how can they not know? I, an idiot with no relevant credentials or knowledge, can fairly accurately determine good research from bad, but all the tenured experts can not? How can they not tell which papers are retracted?
I think the most plausible explanation is that scientists don’t read the papers they cite, which I suppose involves both malice and stupidity.7 Gwern has an interesting write-up on this question, citing some ingenious bibliographic analyses: “Simkin & Roychowdhury venture a guess that as many as 80% of authors citing a paper have not actually read the original”. Once a paper is out there nobody bothers to check it, even though they know there’s a 50-50 chance it’s false!
Having read the original article, I was surprised at how long it was (compared to the brief excerpts), and how scathing it was, and how funny it was <3
A typical paper doesn’t just contain factual claims about standard questions, but also theoretical discussion and a point of view on the ideas that form the fabric of a field. Papers are often referenced to clarify the meaning of a theoretical discussion, or to give credit for inspiring the direction in which the discussion moves. This aspect doesn’t significantly depend on truth of findings of particular studies, because an interesting concept motivates many studies that both experimentally investigate and theoretically discuss it. Some of the studies will be factually bogus, but the theoretical discussion in them might still be relevant to the concept, and useful for subsequent good studies.
So a classification of citations into positive and negative ignores this important third category, something like conceptual reference citation.
Maybe we need a yearly award for the scientist who cites the most redacted papers?
I appreciated the analysis of what does and doesn’t replicate, but the author has clearly never been in academia and many of their recommendations are off base. Put another way, the “what’s wrong with social science” part is great, and the “how to fix it” is not.
Which specific parts did you have in mind?
My claims are really just for CS, idk how much they apply to the social sciences, but the post gives me no reason to think they aren’t true for the social sciences as well.
This doesn’t work unless it’s common knowledge that the research is bad, since reviewers are looking for reasons to reject and “you didn’t cite this related work” is a classic one (and your paper might be reviewed by the author of the bad work). When I was early in my PhD, I had a paper rejected where it sounded like a major contributing factor was not citing a paper that I specifically thought was not related but the reviewer thought was.
I think the point of this recommendation is to get people to stop citing bad research. I doubt it will make a difference since as argued above the cause isn’t “we can’t tell which research is bad” but “despite knowing what’s bad we have to cite it anyway”.
I have issues with this, but they aren’t related to me knowing more about academia than the author, so I’ll skip it. (And it’s more like, I’m uncertain about how good an idea this would be.)
The evidence in the post suggesting that people aren’t acting in good faith is roughly “if you know statistics then it’s obvious that the papers you’re writing won’t replicate”. My guess is that many social scientists don’t know statistics and/or don’t apply it intuitively, so I don’t see a reason to reject the (a priori more plausible to me) hypothesis that most people are acting in okay-to-good faith.
I don’t really understand the author’s model here, but my guess is that they are assuming that academics primarily think about “here’s the dataset and here are the analysis results and here are the conclusions”. I can’t speak to social science, but when I’m trying to figure out some complicated thing (e.g “why does my algorithm work in setting X but not setting Y”) I spend most of my time staring at data, generating hypotheses, making predictions with them, etc. which is very very conducive to the garden of forking paths that the author dismisses out of hand.
EDIT: Added some discussion of the other recommendations below, though I know much less about them, and here I’m just relying more on my own intuition rather than my knowledge about academia:
I’d be shocked if 3⁄4 of social science papers could have been preregistered. My guess is that what happens is that researchers collect data, do a bunch of analyses, figure out some hypotheses, and only then write the paper.
Possibly the suggestion here is that all this exploratory work should be done first, then a study should be preregistered, and then the results are reported. My weak guess is that this wouldn’t actually help replicability very much—my understanding is that researchers are often able to replicate their own results, even when others can’t. (Which makes sense! If I try to describe to a CHAI intern an algorithm they should try running, I often have the experience that they do something differently than I was expecting. Ideally in social science results would be robust to small variations, but in practice they aren’t, and I wouldn’t strongly expect preregistration to help, though plausibly it would.)
My general qualms about preregistration apply here too, but if we assume that we’re going to have a preregistration model, then this seems good to me.
This seems good to me (though idk if 10% is the right number, I could see both higher and lower).
Personally, I don’t like the idea of significance thresholds and required sample sizes. I like having quantitative data because it informs my intuitions; I can’t just specify a hard decision rule based on how some quantitative data will play out.
Even if this were implemented, I wouldn’t predict much effect on reproducibility: I expect that what happens is the papers we get have even more contingent effects that only the original researchers can reproduce, which happens via them traversing the garden of forking paths even more. Here’s an example with p-values of .002 and .006.
Andrew Gelman makes a similar case.
I am very on board with citation counts being terrible, but what should be used instead? If you evaluate based on predicted replicability, you incentivize research that says obvious things, e.g. “rain is correlated with wet sidewalks”.
I suspect that you probably could build a better and still cost-efficient evaluation tool, but it’s not obvious how.
Seems good, though I’d want to first understand what purpose IRBs serve (you’d have to severely roll back IRBs for open data to become a norm).
I approve of the goal “minimize fraud”. This recommendation is too vague for me to comment on the strategy.
This seems to assume that the NSF would be more competent than journals for some reason. I don’t think the problem is with journals per se, I think the problem is with peer review, so if the NSF continues to use peer review as the author suggests, I don’t expect this to fix anything.
The author also suggests using a replication prediction market; as I mentioned above you don’t want to optimize just for replicability. Possibly you could have replication + some method of incentivizing novelty / importance. The author does note this issue elsewhere but just says “it’s a solvable problem”. I am not so optimistic. I feel like similar a priori reasoning could have led to the author saying “reproducibility will be a solvable problem”.
Citation count clearly isn’t a good measure of accuracy, but it’s likely a good measure of importance in a field. So we could run some kind of expected value calculation where the usefulness of a paper is measured by
P(result is true) * (# of citations) - P(result is false) * (# of citations) = (# of citations) * [P(result is true) - P(result is false)]
.Edit: where the probabilities are approximated by replication markets. I think this function gives us what we actually want, so optimizing institutions to maximize it seems like a good idea.
Edit: This doesn’t actually represent what we want, since journals can just force everyone to cite the same well replicated study to maximize citation count on that, but it’s a good approximation. Not a great goal, but a good measurement of what we want, but we shouldn’t optimize institutions to maximize it.
Lesswrong upvote count.
Slightly more seriously: Propagation through the academic segments of centerless curation networks. The author might be anticipating continued advances in social media technology, conventions of use, and uptake. Uptake and improvements in conventions of use, at least, seem to be visibly occuring. Advances in technology seem less assured, but I will do what I can.
This problem seems to me to have the flavor of Moloch and/or inadequate equilibria. Your criticisms have two parts, the pre-edit part based on your personal experience, in which you state why the personal actions they recommend are actually not possible because of the inadequate equilibria (i.e. because of academic incentives), and the criticism of the author’s proposed non-personal actions, which you say is just based on intuition.
I think the author would be unsurprised that the personal actions are not reasonable. They have already said this problem requires government intervention, basically to resolve the incentive problem. But maybe at the margin you can take some of the actions that the author refers to in the personal actions. If a paper is on the cusp of “needing to be cited” but you think it won’t replicate, take that into account! Or if reviewing a paper, at least take into account the probability of replication in your decision.
I think you are maybe reading the author’s claim to “stop assuming good faith” too literally. In the subsequent sentence they are basically refining that to the idea that most people are acting in good faith, but are not competent enough for good faith to be a useful assumption, which seems reasonable to me.
Why do you think people don’t already do this?
In general, if you want to make a recommendation on the margin, you have to talk about what the current margin is.
Huh? The sentence I see is
“the predators are running wild” does not mean “most people are acting in good faith, but are not competent enough for good faith to be a useful assumption”.
They have to do it to some extent, otherwise replicability would be literally uncorrelated with publishability, which probably isn’t the case. But because of the outcomes, we can see that people aren’t doing it enough at the margin, so encouraging people to move as far in that direction as they can seems like a useful reminder.
There are two models here, one is that everyone is a homo economicus when citing papers, so no amount of persuasion is going to adjust people’s citations. They are already making the optimal tradeoff based on their utility function of their personal interests vs. society’s interests. The other is that people are subject to biases and blind spots, or just haven’t even really considered whether they have the OPTION of not citing something that is questionable, in which case reminding them is a useful affordance.
I’m trying to be charitable to the author here, to recover useful advice. They didn’t say things in the way I’m saying them. But they may have been pointing in a useful direction, and I’m trying to steelman that.
Even upon careful rereading of that sentence, I disagree. But to parse this out based on this little sentence is too pointless for me. Like I said, I’m trying to focus on finding useful substance, not nitpicking the author, or you!
What are some of the recommendations that seem most off base to you?
Replied to John below
Is a two-thirds replication rate necessarily bad? This is an honest question, since I don’t know what the optimal replication rate would be. Seems worth noting that a) a 100% replication rate seems too high, since it would indicate that people were only doing boring experiments that were certain to replicate b) “replication rate” seems to mean “does the first replication attempt succeed”, and some fraction of replication attempts will fail due to random chance even if the effect is genuine.
I think there’s an idea that a paper with a p=0.05 finding should replicate 95% of the time. If it doesn’t then the p-value was wrong.
That’s not really what a p-value means though, right? The actual replication rate should depend on the prior and the power of the studies.
I don’t think a high replication rate necessarily implies the experiments were boring. Suppose you do 10 experiments, but they’re all speculative and unlikely to be true: let’s say only one of them is looking at a true effect, BUT your sample sizes are enormous and you have a low significance cutoff. So you detect the one effect and get 9 nulls on the others. When people try to replicate them, they have a 100% success rate on both the positive and the negative results.
The fraction of attempts that will fail due to random chance depends on the power, and replicators tend to go for very high levels of power, so typically you’d have about 5% false negatives or so in the replications.
Indeed. Reading an abstract and skimming intro/discussion is as far as it goes in most cases. Sometimes it’s just the title that is enough to trigger a citation. Often it’s “reciting”, copying the references from someone else’s paper on the topic. My guess is that maybe 5% of references in a given paper have actually been read by the authors.
Andrew Gelman’s take here.
I think there is an important (and obvious) third alternative to the two options presented at the end (of the snippet, rather early in the full piece), namely that many scientists are not very interested in the truth value of the papers they cite. This is neither malice nor stupidity. There is simply no mechanism to punish scientists who cite bad science (and it is not clear there should be, in my opinion). If a paper passes the initial hurdle of peer review it is officially Good Enough to be cited as well, even if it is later retracted (or, put differently, “I’m not responsible for the mistakes the people I cited make, the review committee should have caught it!”).
If you’re a scientist, your job is ostensibly to uncover the truth about your field of study, so I think being uninterested in the truth of the papers you cite is at least a little bit malicious.
Certainly, but it’s not malicious in the sense of deliberately citing bad science. More like negligence.