We have a failure to Replicate



While I love to cast shade in Ronald A. Fisher’s direction, he got this one right.

It is the systematic replication and extension of the results of previous studies, and not p values from individual ones, that fosters cumulative knowledge development. That this statement appears to have eluded many applied researchers, as well as editors and reviewers, is puzzling because Fisher (1966) himself put only provisional stock in statistically significant results from single studies: ‘we thereby admit that no isolated experiment, however significant in itself, can suffice for the experimental demonstration of any natural phenomenon’ (p. 13).
Fisher was a major proponent of replication: ‘Fisher had reason to emphasize, as a first principle of experimentation, the function of appropriate replication in providing an estimate of error’ (Fisher Box, 1978, p. 142). Indeed, Fisher Box (1978) insinuates that Fisher coined the term ‘replication’: ‘The method adopted was replication, as Fisher called it; by his naming of what was already a common experimental practice, he called attention to its functional importance in experimentation’[1]

Problem is, replication is dull work. You’re doing the exact same thing someone else did, hoping they screwed up somewhere along the line. When that person happens to be an influential senior scientist in your field, things can get awkward. As Ed Yong puts it,

Positive results in psychology can behave like rumours: easy to release but hard to dispel. They dominate most journals, which strive to present new, exciting research. Meanwhile, attempts to replicate those studies, especially when the findings are negative, go unpublished, languishing in personal file drawers or circulating in conversations around the water cooler. “There are some experiments that everyone knows don’t replicate, but this knowledge doesn’t get into the literature,” says Wagenmakers. The publication barrier can be chilling, he adds. “I’ve seen students spending their entire PhD period trying to replicate a phenomenon, failing, and quitting academia because they had nothing to show for their time.”

These problems occur throughout the sciences, but psychology has a number of deeply entrenched cultural norms that exacerbate them.

So it helps if you do your replications as a group, and explicitly make it a priority. The Reproducibility Project aims to do exactly that, at least for psychology, and they’ve been trickling out their results over the last few months. The latest from them is sobering.

Replication effects were half the magnitude of original effects, representing a substantial decline. Ninety-seven percent of original studies had statistically significant results. Thirty-six percent of replications had statistically significant results; 47% of original effect sizes were in the 95% confidence interval of the replication effect size; 39% of effects were subjectively rated to have replicated the original result; and if no bias in original results is assumed, combining original and replication results left 68% with statistically significant effects.

Ouch. And as the studies they tried to replicate were a representative sample of papers published in top journals, that represents a huge waste of effort and suggests big problems of methodology across the entire field.

As a “fan” of p-values, this part had me chuckling.

The density plots of P values for original studies (mean P value = 0.028) and replications (mean P value = 0.302) are shown in Fig. 1, left. The 64 nonsignificant P values for replications were distributed widely. When there is no effect to detect, the null distribution of P values is uniform. This distribution deviated slightly from uniform with positive skew, however, suggesting that at least one replication could be a false negative, χ2(128) = 155.83, P = 0.048. Nonetheless, the wide distribution of P values suggests against insufficient power as the only explanation for failures to replicate.


[…] A negative correlation of replication success with the original study P value indicates that the initial strength of evidence is predictive of reproducibility. For example, 26 of 63 (41%) original studies with P < 0.02 achieved P < 0.05 in the replication, whereas 6 of 23 (26%) that had a P value between 0.02 < P 0.04 did so (Fig. 2). Almost two thirds (20 of 32, 63%) of original studies with P < 0.001 had a significant P value in the replication.

Flipping that on its head, if your study had a p-value below 0.001 then there’s a 37% chance that a replication would fail to hit statistical significance. It’s clear to me that p-values have a tenuous link to “significance,” and they should be dropped immediately for the good of science.

But it should be clear to everyone that replications are critical to science, no matter the methodology, and must be encouraged to the greatest extent possible.


[1] Hubbard, Raymond, and R. Murray Lindsay. “Why P values are not a useful measure of evidence in statistical significance testing.Theory & Psychology 18.1 (2008): 69-88.


Get every new post delivered to your Inbox.

Join 177 other followers